Michael Perdue Michael Perdue - 7 months ago 12
Python Question

Pandas: read_csv indicating 'space-delimited'

I have the following file.txt (abridged):

SICcode Catcode Category SICname MultSIC
0111 A1500 Wheat, corn, soybeans and cash grain Wheat X
0112 A1600 Other commodities (incl rice, peanuts) Rice X
0115 A1500 Wheat, corn, soybeans and cash grain Corn X
0116 A1500 Wheat, corn, soybeans and cash grain Soybeans X
0119 A1500 Wheat, corn, soybeans and cash grain Cash grains, NEC X
0131 A1100 Cotton Cotton X
0132 A1300 Tobacco & Tobacco products Tobacco X


I'm having some problems reading it into a pandas df. I tried
pd.read_csv
with the following specifications
engine='python', sep='Tab'
but it returned the file in one column:

´╗┐SICcode Catcode Category SICname MultSIC
0 0111 A1500 Wheat, corn, soybeans...
1 0112 A1600 Other commodities (in...
2 0115 A1500 Wheat, corn, soybeans...
3 0116 A1500 Wheat, corn, soybeans...


Then I tried to put it into a gnumeric file using 'tab' as a delimiter, but it read the file as one column. Does anyone have an idea on this?

Answer

If df = pd.read_csv('file.txt', sep='\t') returns a DataFrame with one column, then apparently file.txt is not using tabs as separators. Your data might simply have spaces as separators. In that case you could try

df = pd.read_csv('data', sep=r'\s{2,}')

which uses the regex pattern \s{2,} as the separator. This regex matches 2-or-more whitespace characters.

In [8]: df
Out[8]: 
   SICcode Catcode                                Category           SICname  \
0      111   A1500    Wheat, corn, soybeans and cash grain             Wheat   
1      112   A1600  Other commodities (incl rice, peanuts)              Rice   
2      115   A1500    Wheat, corn, soybeans and cash grain              Corn   
3      116   A1500    Wheat, corn, soybeans and cash grain          Soybeans   
4      119   A1500    Wheat, corn, soybeans and cash grain  Cash grains, NEC   
5      131   A1100                                  Cotton            Cotton   
6      132   A1300              Tobacco & Tobacco products           Tobacco   

  MultSIC  
0       X  
1       X  
2       X  
3       X  
4       X  
5       X  
6       X  

If this does not work, please post print(repr(open(file.txt, 'rb').read(100)). This will show us an unambiguous representation of the first 100 bytes of file.txt.