ukbaz ukbaz - 1 month ago 7
Python Question

reading a log with dashed lines into a pandas dataframe

I have a tricky log file that I wish to to get into a clean DF. The format of the log is the following ;

===============================================================================
2016/03/28 12:26:45 - Message
-------------------------------------------------------------------------------
2016/03/28 12:26:45 - Message
2016/03/28 12:26:45 - Message
Message
2016/03/28 12:26:45 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:28:30 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
-------------------------------------------------------------------------------
2016/03/28 12:28:40 - Message
===============================================================================


The log continues in the above pattern, my goal is to have the following dataframe;

Time Text
2016/03/28 12:26:45 Message


I have tired to parse the file file on the '-' and create a Dataframe, and remove the dashed lines.

import pandas as pd
from pandas.compat import StringIO

clean = open(filename).read().remove('-------------------------------------------------------------------------------', '')
clean2 = open(filename).read().replace('===============================================================================', '')
df = pd.read_csv(filename, sep = "\s*\-", names = ["Time", "Text"], engine = "python")
df.Time = pd.to_datetime(df.Time, format='%d/%m/%y %H:%M:%S.%f')
df.Text = df.Text


However I get alot of NaN columns, any help is appreciated

Answer

I think you can use to_datetime with errors='coerce' for replace bad data to NaT with dropna for remove all rows where NaT in column Time:

import pandas as pd
from pandas.compat import StringIO

temp=u"""===============================================================================
2016/03/28 12:26:45 - Message
-------------------------------------------------------------------------------
2016/03/28 12:26:45 - Message
2016/03/28 12:26:45 - Message
Message
2016/03/28 12:26:45 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:28:30 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
-------------------------------------------------------------------------------
2016/03/28 12:28:40 - Message
==============================================================================="""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s+-\s+", names = ["Time", "Text"], engine = "python")

df.Time = pd.to_datetime(df.Time, errors='coerce') 
df.dropna(subset=['Time'], inplace=True)  
print (df)
                  Time     Text
1  2016-03-28 12:26:45  Message
3  2016-03-28 12:26:45  Message
4  2016-03-28 12:26:45  Message
6  2016-03-28 12:26:45  Message
7  2016-03-28 12:26:46  Message
8  2016-03-28 12:26:46  Message
9  2016-03-28 12:28:30  Message
10 2016-03-28 12:28:40  Message
11 2016-03-28 12:28:40  Message
12 2016-03-28 12:28:40  Message
14 2016-03-28 12:28:40  Message