user5779223 user5779223 - 5 months ago 48
Python Question

How to read text file with uneven number of columns with python-pandas?

Given the file with following format:

really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative


The last column is the label of polarity, with values negative or positive. The other columns before it are the bag-of-words representation of the corresponding paragraph. How can I read the file into a data frame with two columns that the first is the bag-of-word string and the second is the label? Thank you in advance!

Answer

You need read_csv only:

import pandas as pd
import io

temp=u"""really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep=r" label:",
                 header=None, 
                 names=['bag','label'], 
                 engine='python')
print (df)
                                                 bag     label
0       really:1 christensen:1 scariest:1 many_of:1   positive
1  varied_experiences:1 experiences_from:1 island...  positive
2                              scariest:1 many_of:1   negative

More general solution, which rsplit by last whitespace:

import pandas as pd
import io

temp=u"""really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep=";", #some string which is NOT in all text
                 header=None, 
                 names=['text'])
print (df)
                                                text
0  really:1 christensen:1 scariest:1 many_of:1 la...
1  varied_experiences:1 experiences_from:1 island...
2                scariest:1 many_of:1 label:negative

df[['bag','label']] = df.text.str.rsplit(expand=True, n=1)
df = df.drop('text', axis=1)
print (df)
                                                 bag           label
0        really:1 christensen:1 scariest:1 many_of:1  label:positive
1  varied_experiences:1 experiences_from:1 island...  label:positive
2                               scariest:1 many_of:1  label:negative
Comments