Peter Whitehead Peter Whitehead - 2 months ago 5
Python Question

How to create rows and columns in a .csv file from .log file

I am trying to parse a .log file from MTurk in to a .csv file with rows and columns using Python. My data looks like:

P:,14142,GREEN,800,9;R:,14597,7,y,NaN,Correct;P:,15605,#E5DC22,800,9;R:,16108,7,f,NaN,Correct;P:,17115,GREEN,100,9;R:,17548,7,y,NaN,Correct;P:,18552,#E5DC22,100,9;R:,18972,7,f,NaN,Correct;P:,19979,GREEN,800,9;R:,20379,7,y,NaN,Correct;P:,21387,#E5DC22,800,9;R:,21733,7,f,NaN,Correct;P:,22740,RED,100,9;R:,23139,7,y,NaN,False;P:,24147,BLUE,100,9;R:,24547,7,f,NaN,False;P:,25555,RED,800,9;R:,26043,7,b,NaN,Correct;P:,27051,BLUE,800,9;

Currently, I have this, which puts everything in to columns:

import pandas as pd

from pandas import read_table

log_file = '3BF51CHDTWYBE3LE8DZRA0R5AFGH0H.log'

df = read_table(log_file, sep=';|,', header=None, engine='python')

Like this:

P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|P|15605|#E5DC22|800|9|R|16108

However, I cannot seem to be able to break this in to multiple rows, so that it would look more like this:

P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|
|P|15605|#E5DC22|800|9|R|16108

ie. where all the "P"s would be in one column, where all the colors would be in another, the "r"s, etc..

Any help is much appreciated. Thank you!


Answer

You can use

In [16]: df = pd.read_csv('log.txt', lineterminator=';', sep=':', header=None)

to read the file (say, 'log.txt') assuming that the lines are terminated by ';', and the separators within lines are ':'.

Unfortunately, your second column will now contain commas, which you'd like to logically separate. You can split the commas along the lines, and concatenate the result to the first column:

In [17]: pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1)
Out[17]: 
       0      1        2    3    4        5
0      P  14142    GREEN  800    9      NaN
1      R  14597        7    y  NaN  Correct
2      P  15605  #E5DC22  800    9      NaN
3      R  16108        7    f  NaN  Correct
4      P  17115    GREEN  100    9      NaN
5      R  17548        7    y  NaN  Correct
6      P  18552  #E5DC22  100    9      NaN
7      R  18972        7    f  NaN  Correct
8      P  19979    GREEN  800    9      NaN
9      R  20379        7    y  NaN  Correct
10     P  21387  #E5DC22  800    9      NaN
11     R  21733        7    f  NaN  Correct
12     P  22740      RED  100    9      NaN
13     R  23139        7    y  NaN    False
14     P  24147     BLUE  100    9      NaN
15     R  24547        7    f  NaN    False
16     P  25555      RED  800    9      NaN
17     R  26043        7    b  NaN  Correct
18     P  27051     BLUE  800    9      NaN
19  \n\n    NaN      NaN  NaN  NaN      NaN
Comments