Kris Harper Kris Harper - 1 month ago 7
Python Question

Can I apply a vectorized function to a pandas dataframe?

I am pretty new to

pandas
and
numpy
, and I'm trying to figure out the best way to do some things.

Right now I am trying to call a function on every row of a
dataframe
. If I pass in three
numpy
arrays to this function, it's very fast, but using
apply
on the
dataframe
is very slow.

My guess is that
numpy
is using vectorized functions in the first case, and not in the second. Is there a way to get
pandas
to use that optimization? Basically, in pseudocode I think
apply
is doing something like
for row in frame: func(row['a'], row['b'], row['c'])
but I want it to do
func(col['a'], col['b'], col['c'])
.

Here is an example of what I am trying to do.

import numpy as np
import pandas as pd
from scipy.stats import beta

count = 100000

# If I start with a given dataframe and use apply, it's very slow

df = pd.DataFrame(np.random.uniform(0, 1, size=(count, 3)), columns=['a', 'b', 'c'])
df.apply(lambda frame: beta.cdf(frame['a'], frame['b'], frame['c']), axis=1)

# However, if I split out each column into a numpy array, this is very fast.

a = df['a'].as_matrix()
b = df['b'].as_matrix()
c = df['c'].as_matrix()

beta.cdf(a, b, c)

# But at this point I've lost the context of the dataframe.
# I would like to keep the results in a new column for further processing

Answer

It's not clear why you're trying to use apply. You can just do beta.cdf(df.a, df.b, df.c).

Comments