Nickj Nickj - 3 years ago 246
Python Question

Read CSV file with space as thousand-seperator using pandas.read_csv

I have a (French) dataset that looks like the following:

time;col1;col2;col3
06.09.2017 05:30;329,02;5,7;259
06.09.2017 05:40;500,5;6,6;261
06.09.2017 05:50;521,73;6,7;266
06.09.2017 06:00;1 091,33;9,1;273
06.09.2017 06:10;1 262,43;10;285


I try to read it using the following command:

import pandas as pd
df=pd.read_csv("Example_dataset.csv",
index_col=0,
encoding='latin',
parse_dates=True,
dayfirst=True,
sep=';',
decimal=',',
thousands=' ')


col2 and col3 are recognized as float and integer, though col1 is not recognized as a number because of the thousands-seperator in there. Is there an easy way to read this dataset? Setting
thousands=' '
does not seem to work:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5 entries, 2017-09-06 05:30:00 to 2017-09-06 06:10:00
Data columns (total 3 columns):
col1 5 non-null object
col2 5 non-null float64
col3 5 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 160.0+ bytes


Any suggestions?

Answer Source

If you have non-breaking spaces, I would suggest a more aggressive regular expression with str.replace:

df.col1 = df.col1.str.replace('[^\d.,e+-]', '')\
               .str.replace(',', '.').astype(float)

Regex

[       # character group
^       # negation - ignore everything in this character group
\d      # digit
.       # dot
e       # 'e' - exponent
+-      # signs 
]      
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download