ShanZhengYang ShanZhengYang - 2 months ago 16
Python Question

Cannot parse the following text file into a pandas dataframe?

I have the following text file

file1.txt
in this format (showing it exactly as I see it):

3612 11.4 21.5 1.3 cat3 10469 11447 9239174 - Smith David
484 25.1 13.2 0.0 cat3 11505 11675 9238946 - John Mary
239 29.4 1.9 1.0 cat3 11678 11780 9238841 + Weiz Parker
318 23.0 3.7 0.0 cat3 15265 15355 9235266 + Cohen Charles
18 23.2 0.0 2.0 cat3 15798 15849 9234772 + Lopez Beth
463 1.3 0.6 1.7 cat3 10001 10468 9240153 + Brown Charlie


I wanted to immediately load this into a Pandas DataFrame with

import pandas as pd
df = pd.DataFrame("file1.txt")


But this gives me a dataframe with only one column.

So, I tried to parse this file into a
.csv
with Python. The problem is that this isn't a "constant" delimiter, i.e. it's not a tab.

import csv
input_text = csv.reader(open("file1.txt", "r"), delimiter = "\t")
output_csv = csv.writer(open("file1.csv", 'w'))
output_csv.writerows(input_text) # this should write a csv "file1.csv"


However, this gives the same results. The delimiter
delimiter = ""
also doesn't work.

How can I parse this text file into csv format? Can I do this with Python? (or do I need awk?) Should I be "skipping" the intermediary csv step and try to do this entirely in pandas?

Any help appreciated!

Answer

Use pd.read_csv() with a separator and specify the column names and also specify that there are no column headers already included in the csv file.

In [24]: pd.read_csv("file1.txt", header=None, names=[chr(i) for i in range(65, 75)], sep="\s+")
Out[24]: 
         A     B    C     D      E      F        G  H      I        J
3612  11.4  21.5  1.3  cat3  10469  11447  9239174  -  Smith    David
484   25.1  13.2  0.0  cat3  11505  11675  9238946  -   John     Mary
239   29.4   1.9  1.0  cat3  11678  11780  9238841  +   Weiz   Parker
318   23.0   3.7  0.0  cat3  15265  15355  9235266  +  Cohen  Charles
18    23.2   0.0  2.0  cat3  15798  15849  9234772  +  Lopez     Beth
463    1.3   0.6  1.7  cat3  10001  10468  9240153  +  Brown  Charlie
Comments