Saranya Krishnamurthy Saranya Krishnamurthy - 2 months ago 30
Python Question

pd.read_csv not correctly parsing date/month field when set parse_date = ['column name']

I ran in to this bug while trying to parse the few dates through parse_dates of

pandas.read_csv()
. In the following code snippet, I'm trying to parse dates that have format
dd/mm/yy
which is resulting me an improper conversion. For some cases, the date field is considered as month and vice versa.

To keep it simple, for some cases
dd/mm/yy
get converted to
yyyy-dd-mm
instead of
yyyy-mm-dd
.

Case 1:

04/10/96 is parsed as 1996-04-10, which is wrong.


Case 2:

15/07/97 is parsed as 1997-07-15, which is correct.


Case 3:

10/12/97 is parsed as 1997-10-12, which is wrong.


Code Sample

import pandas as pd

df = pd.read_csv('date_time.csv')
print 'Data in csv:'
print df
print df['start_date'].dtypes

print '----------------------------------------------'

df = pd.read_csv('date_time.csv', parse_dates = ['start_date'])
print 'Data after parsing:'
print df
print df['start_date'].dtypes


Current Output

----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-04-10
1 1997-07-15
2 1997-10-12
3 1999-06-03
4 1994-01-01
5 1967-02-01
datetime64[ns]


Expected Output

----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date

0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
datetime64[ns]


More Comments:

I could use
date_parser
or
pandas.to_datetime()
to specify the proper format for date. But in my case, I have few date fields like
['//1997', '/02/1967']
for which I need to convert
['01/01/1997','01/02/1967']
. The
parse_dates
helps me in converting those type of date fields to the expected format without making me to write extra line of code.

Is there any solution for this?

Bug Link @GitHub: https://github.com/pydata/pandas/issues/13063

Answer

In version pandas 0.18.0 you can add parameter dayfirst=True and then it works:

import pandas as pd
import io

temp=u"""start_date
04/10/96
15/07/97
10/12/97
06/03/99
//1994
/02/1967
"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),  parse_dates = ['start_date'], dayfirst=True)
  start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01

Another solution:

You can parsing with to_datetime with different parameters format and errors='coerce' and then combine_first:

date1 = pd.to_datetime(df['start_date'], format='%d/%m/%y', errors='coerce')
print date1
0   1996-10-04
1   1997-07-15
2   1997-12-10
3   1999-03-06
4          NaT
5          NaT
Name: start_date, dtype: datetime64[ns]

date2 = pd.to_datetime(df['start_date'], format='/%m/%Y', errors='coerce')
print date2
0          NaT
1          NaT
2          NaT
3          NaT
4          NaT
5   1967-02-01
Name: start_date, dtype: datetime64[ns]

date3 = pd.to_datetime(df['start_date'], format='//%Y', errors='coerce')
print date3
0          NaT
1          NaT
2          NaT
3          NaT
4   1994-01-01
5          NaT
Name: start_date, dtype: datetime64[ns]
print date1.combine_first(date2).combine_first(date3)
0   1996-10-04
1   1997-07-15
2   1997-12-10
3   1999-03-06
4   1994-01-01
5   1967-02-01
Name: start_date, dtype: datetime64[ns]