manisha manisha - 6 months ago 11
Python Question

Converting space-aligned text file into Pandas DataFrame

I am quite new to pandas. I have a log text file. I am trying to grab few data point from the file. Below is the code that kind of gets me the desired data but not in desired format. I wanted Pandas data frame with two columns.

import os
from collections import Counter
import pandas as pd
#print(os.getcwd())
infile = "myfile.txt"

important = []
keep_phrases = ["Host",
"User-Agent"
]

with open(infile) as f:
f = f.readlines()

for line in f:
for phrase in keep_phrases:
if phrase in line:
important.append(line)

break
#print(type(important))
print(important)
#Counter(important)
pd.DataFrame(important)


This does not give me output in two column. I am looking for host and user agent as one row.

Sample of the text file as below

15 SessionOpen c aa.bb.cc.ddd 62667 :8080
15 SessionClose c pipe
15 ReqStart c aa.bb.cc.ddd 62667 442374415
15 RxURL c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag666
15 RxHeader c Host: ll.abrstream.channel4.com
15 RxHeader c Connection: keep-alive
15 RxHeader c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
15 RxHeader c X-Requested-With: ShockwaveFlash/21.0.0.216
15 RxHeader c Accept: */*
15 RxHeader c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
15 RxHeader c Accept-Encoding: gzip, deflate, sdch
15 RxHeader c Accept-Language: en-US,en;q=0.8
15 ReqEnd c 442374415 1461870946.496117592 1461870947.112555504 0.000315428 0.001363039 0.615074873
15 SessionOpen c aa1.bb1.cc1.ddd1 59409 :8080
15 SessionClose c pipe
15 ReqStart c aa1.bb1.cc1.ddd1 59409 442374416
15 RxURL c /gpsApi.php
15 RxHeader c Content-Length: 0
15 RxHeader c Host: map.yanue.net
15 RxHeader c Connection: Keep-Alive
15 RxHeader c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
15 ReqEnd c 442374416 1461870950.580444574 1461870951.139206648 0.000064135 0.001196861 0.557565212
15 SessionOpen c aa1.bb1.cc1.ddd1 52179 :8080
15 SessionClose c pipe
15 ReqStart c aa1.bb1.cc1.ddd1 52179 442374417
15 RxURL c /gpsApi.php
15 RxHeader c Content-Length: 0
15 RxHeader c Host: map.yanue.net
15 RxHeader c Connection: Keep-Alive
15 RxHeader c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
15 ReqEnd c 442374417 1461870951.776547432 1461870952.448071241 0.000062943 0.001109123 0.670414686
18 SessionOpen c aa.bb.cc.ddd 62670 :8080
18 SessionClose c pipe
18 ReqStart c aa.bb.cc.ddd 62670 442374418
18 RxURL c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag667
18 RxHeader c Host: ll.abrstream.channel4.com
18 RxHeader c Connection: keep-alive
18 RxHeader c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
18 RxHeader c X-Requested-With: ShockwaveFlash/21.0.0.216
18 RxHeader c Accept: */*
18 RxHeader c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
18 RxHeader c Accept-Encoding: gzip, deflate, sdch
18 RxHeader c Accept-Language: en-US,en;q=0.8
18 ReqEnd c 442374418 1461870951.920178175 1461870952.507097483 0.001731873 0.001337051 0.585582256
15 SessionOpen c aa1.bb1.cc1.ddd1 48034 :8080
15 SessionClose c pipe

Answer

You can create a dataframe by creating a list of lists, and then use the dataframe constructor.

Loop through each line of the file, like you've started doing, then split each line into the different columns. You can use re.split to create a list of the columns, limitting the maximum number of splits to treat the last column as one element. Alternatively, if you know each element is always going to be aligned in the same way, you can use slicing to create that list.

import re

df_list = []
with open(infile) as f:
    for line in f:
        # remove whitespace at the start and the newline at the end
        line = line.strip()
        # split each column on whitespace
        columns = re.split('\s+', line, maxsplit=4)
        df_list.append(columns)

You can then use the method in this answer to create the dataframe.

df = pd.DataFrame(df_list)
Comments