linpingta linpingta - 1 year ago 130
Python Question

Pandas - Explanation on apply function being slow

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).

I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)

For my case, I have two kinds of tasks to do with the apply function.

First: apply with lookup dict query

f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']

p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))

Second: apply with groupby

f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()

f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])

I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.

My question is:

  1. Is there any way to substitute apply in the two above cases?

  2. Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.

Answer Source

Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).

For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:

import string, sys

import numpy as np
import pandas as pd

# Part 1 - filling a column by a lookup table
def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop

# Using merge
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)

>>> 1000 loops, best of 5: 826 ┬Ás per loop