Peter Whitehead - 6 months ago 16

Python Question

I am trying to parse a .log file from MTurk in to a .csv file with rows and columns using Python. My data looks like:

P:,14142,GREEN,800,9;R:,14597,7,y,NaN,Correct;P:,15605,#E5DC22,800,9;R:,16108,7,f,NaN,Correct;P:,17115,GREEN,100,9;R:,17548,7,y,NaN,Correct;P:,18552,#E5DC22,100,9;R:,18972,7,f,NaN,Correct;P:,19979,GREEN,800,9;R:,20379,7,y,NaN,Correct;P:,21387,#E5DC22,800,9;R:,21733,7,f,NaN,Correct;P:,22740,RED,100,9;R:,23139,7,y,NaN,False;P:,24147,BLUE,100,9;R:,24547,7,f,NaN,False;P:,25555,RED,800,9;R:,26043,7,b,NaN,Correct;P:,27051,BLUE,800,9;

Currently, I have this, which puts everything in to columns:

import pandas as pd

from pandas import read_table

log_file = '3BF51CHDTWYBE3LE8DZRA0R5AFGH0H.log'

df = read_table(log_file, sep=';|,', header=None, engine='python')

Like this:

P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|P|15605|#E5DC22|800|9|R|16108

However, I cannot seem to be able to break this in to multiple rows, so that it would look more like this:

P|14142|GREEN|800|9|R|14597|7|y|NaN|Correct|

|P|15605|#E5DC22|800|9|R|16108

ie. where all the "P"s would be in one column, where all the colors would be in another, the "r"s, etc..

Any help is much appreciated. Thank you!

Answer

You can use

```
In [16]: df = pd.read_csv('log.txt', lineterminator=';', sep=':', header=None)
```

to read the file (say, `'log.txt'`

) assuming that the lines are terminated by `';'`

, and the separators within lines are `':'`

.

Unfortunately, your second column will now contain commas, which you'd like to logically separate. You can split the commas along the lines, and concatenate the result to the first column:

```
In [17]: pd.concat([df[[0]], df[1].str.split(',').apply(pd.Series).iloc[:, 1: 6]], axis=1)
Out[17]:
0 1 2 3 4 5
0 P 14142 GREEN 800 9 NaN
1 R 14597 7 y NaN Correct
2 P 15605 #E5DC22 800 9 NaN
3 R 16108 7 f NaN Correct
4 P 17115 GREEN 100 9 NaN
5 R 17548 7 y NaN Correct
6 P 18552 #E5DC22 100 9 NaN
7 R 18972 7 f NaN Correct
8 P 19979 GREEN 800 9 NaN
9 R 20379 7 y NaN Correct
10 P 21387 #E5DC22 800 9 NaN
11 R 21733 7 f NaN Correct
12 P 22740 RED 100 9 NaN
13 R 23139 7 y NaN False
14 P 24147 BLUE 100 9 NaN
15 R 24547 7 f NaN False
16 P 25555 RED 800 9 NaN
17 R 26043 7 b NaN Correct
18 P 27051 BLUE 800 9 NaN
19 \n\n NaN NaN NaN NaN NaN
```