ppaulojr ppaulojr - 1 month ago 11
Python Question

Pandas: slow date conversion

I'm reading a huge

CSV
with a date field in the format
YYYYMMDD
and I'm using the following lambda to convert it when reading:

import pandas as pd

df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=lambda t:pd.to_datetime(str(t),
format='%Y%m%d', coerce=True))


This function is very slow though.

Any suggestion to improve it?

Answer

Try using this function for parsing dates:

def lookup(s):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """
    dates = {date:pd.to_datetime(date) for date in s.unique()}
    return s.map(dates)

Use it like:

df['date-column'] = lookup(df['date-column'])

Benchmarks:

$ python date-parse.py
to_datetime: 5799 ms
dateutil:    5162 ms
strptime:    1651 ms
manual:       242 ms
lookup:        32 ms

Source: https://github.com/sanand0/benchmarks/tree/master/date-parse

Comments