David David - 1 year ago 91
Python Question

Parsing CSV file in pandas with commas

I need to create a

pandas.DataFrame
from a
csv
file. For that I am using the method
pandas.csv_reader(...)
. The problem with this file is that one or more columns contain commas within the values (I don't control the file format).
I been trying to implement the solution from this question but I get the following error:

pandas.errors.EmptyDataError: No columns to parse from file


For some reason after implementing this solution the csv file I tried fixing is blank.

Here is the code I am using:

# fix csv file
with open ("/Users/username/works/test.csv",'rb') as f,\
open("/Users/username/works/test.csv",'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 4)
writer.writerow(row)
# Manipulate csv file
data = pd.read_csv(os.path.expanduser\
("/Users/username/works/test.csv"),error_bad_lines=False)


Any ideas?

Data overview:

Id0 Id 1 Id 2 Country Company Title Email
23 123 456 AR name cargador [email protected]

24 123 456 AR name Executive assistant [email protected]

25 123 456 AR name Asistente Administrativo [email protected]

26 123 456 AR name Atención al cliente vía telefónica vía online [email protected]
39 123 456 AR name Asesor de ventas [email protected]

40 123 456 AR name inc. International company representative [email protected]
41 123 456 AR name Vendedor de campo [email protected]

42 123 456 AR name PUBLICIDAD ATENCIÓN AL CLIENTE [email protected]
43 123 456 AR name Asistente de Marketing [email protected]

44 123 456 AR name SOLDADOR [email protected]
217 123 456 AR name Se requiere vendedores Loja Quevedo Guayas) [email protected]
218 123 456 AR name Ing. Civil recién graduado Yaruquí [email protected]
219 123 456 AR name ayudantes enfermeria [email protected]

220 123 456 AR name Trip Leader for International Youth Exchange [email protected]
221 123 456 AR name COUNTRY MANAGER / DIRECTOR COMERCIAL [email protected]
250 123 456 AR name Ayudante de Pasteleria [email protected] Asesor [email protected] [email protected]


Pre-parsed CSV:

#,Id 1,Id 2,Country,Company,Title,Email,,,,
23,123,456,AR,name,cargador,[email protected],,,,
24,123,456,AR,name,Executive assistant,[email protected],,,,
25,123,456,AR,name,Asistente Administrativo,[email protected],,,,
26,123,456,AR,name,Atención al cliente vía telefónica , vía online,[email protected],,,
39,123,456,AR,name,Asesor de ventas,[email protected],,,,
40,123,456,AR,name, inc.,International company representative,[email protected],,,
41,123,456,AR,name,Vendedor de campo,[email protected],,,,
42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,[email protected],,,
43,123,456,AR,name,Asistente de Marketing,[email protected],,,,
44,123,456,AR,name,SOLDADOR,[email protected],,,,
217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),[email protected]
218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,[email protected],,,
219,123,456,AR,name,ayudantes enfermeria,[email protected],,,,
220,123,456,AR,name,Trip Leader for International Youth Exchange,[email protected],,,,
221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,[email protected],,,,
250,123,456,AR,name,Ayudante de Pasteleria,[email protected], Asesor,[email protected],[email protected],
251,123,456,AR,name,Ejecutiva de Ventas,[email protected],,,,

Answer Source

If you can assume that all of the errant commas are in the column prior to the email address, then a small parser can be written to process that.

Code:

import csv
import re

VALID_EMAIL = re.compile(r'[^@][email protected][^@]+\.[^@]+')

def read_my_csv(file_handle):
    # build csv reader
    reader = csv.reader(file_handle)

    # get the header, and find the e-mail column
    header = next(reader)
    email_column = header.index('Email')

    # yield the header up to the e-mail column
    yield header[:email_column+1]

    # for each row, go through and put the Title column back together
    for row in reader:
        while not VALID_EMAIL.match(row[email_column]):
            row[email_column-1] += ',' + row[email_column]
            del row[email_column]
        yield row[:email_column+1]

Test Code:

with open ("test.csv", 'rU') as f:
    generator = read_my_csv(f)
    columns = next(generator)
    df = pd.DataFrame(generator, columns=columns)

print(df)

Results:

      # Id 1 Id 2 Country Company  \
0    23  123  456      AR    name   
1    24  123  456      AR    name   
2    25  123  456      AR    name   
3    26  123  456      AR    name   
4    39  123  456      AR    name   
5    40  123  456      AR    name   
6    41  123  456      AR    name   
7    42  123  456      AR    name   
8    43  123  456      AR    name   
9    44  123  456      AR    name   
10  217  123  456      AR    name   
11  218  123  456      AR    name   
12  219  123  456      AR    name   
13  220  123  456      AR    name   
14  221  123  456      AR    name   
15  250  123  456      AR    name   
16  251  123  456      AR    name   

                                               Title            Email  
0                                           cargador  [email protected]  
1                                Executive assistant  [email protected]  
2                           Asistente Administrativo  [email protected]  
3    Atención al cliente vía telefónica , vía online  [email protected]  
4                                   Asesor de ventas  [email protected]  
5          inc.,International company representative  [email protected]  
6                                  Vendedor de campo  [email protected]  
7                    PUBLICIDAD, ATENCIÓN AL CLIENTE  [email protected]  
8                             Asistente de Marketing  [email protected]  
9                                           SOLDADOR  [email protected]  
10  Se requiere vendedores,, Loja , Quevedo, Guayas)  [email protected]  
11               Ing. Civil recién graduado, Yaruquí  [email protected]  
12                              ayudantes enfermeria  [email protected]  
13      Trip Leader for International Youth Exchange  [email protected]  
14              COUNTRY MANAGER / DIRECTOR COMERCIAL  [email protected]  
15                            Ayudante de Pasteleria  [email protected]  
16                               Ejecutiva de Ventas  [email protected]  
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download