whytheq - 1 year ago 78
Python Question

# Is this approach "vectorized" - used against medium dataset it is relatively slow

I have this data frame:

``````df = pd.DataFrame({'a' : np.random.randn(9),
'b' : ['foo', 'bar', 'blah'] * 3,
'c' : np.random.randn(9)})
``````

This function:

``````def my_test2(row, x):
if x == 'foo':
blah = 10
if x == 'bar':
blah = 20
if x == 'blah':
blah = 30
return (row['a'] % row['c']) + blah
``````

I am then creating 3 new columns like this:

``````df['Value_foo'] = df.apply(my_test2, axis=1, x='foo')
df['Value_bar'] = df.apply(my_test2, axis=1, x='bar')
df['Value_blah'] = df.apply(my_test2, axis=1, x='blah')
``````

It runs ok but when I make my_test2 more complex and expand df to several thousand rows it is slow - is the above what I hear described as "vectorized"? Can I easily speed things up?

As Andrew, Ami Tavory and Sohier Dane have already mentioned in comments there are two "slow" things in your solution:

1. `.apply()` is generally slow as it loops under the hood.
2. `.apply(..., axis=1)` is extremely slow (even compared to `.apply(..., axis=0)`) as it does looping on the row basis

Here is a vectorized approach:

``````In [74]: d = {
....:   'foo': 10,
....:   'bar': 20,
....:   'blah': 30
....: }

In [75]: d
Out[75]: {'bar': 20, 'blah': 30, 'foo': 10}

In [76]: for k,v in d.items():
....:         df['Value_{}'.format(k)] = df.a % df.c + v
....:

In [77]: df
Out[77]:
a     b         c  Value_bar  Value_blah  Value_foo
0 -0.747164   foo  0.438713  20.130262   30.130262  10.130262
1 -0.185182   bar  0.047253  20.003828   30.003828  10.003828
2  1.622818  blah -0.730215  19.432174   29.432174   9.432174
3  0.117658   foo  1.530249  20.117658   30.117658  10.117658
4  2.536363   bar -0.100726  19.917499   29.917499   9.917499
5  1.128002  blah  0.350663  20.076014   30.076014  10.076014
6  0.059516   foo  0.638910  20.059516   30.059516  10.059516
7 -1.184688   bar  0.073781  20.069590   30.069590  10.069590
8  1.440576  blah -2.231575  19.209001   29.209001   9.209001
``````

Timing against 90K rows DF:

``````In [80]: big = pd.concat([df] * 10**4, ignore_index=True)

In [81]: big.shape
Out[81]: (90000, 3)

In [82]: %%timeit
....: big['Value_foo'] = big.apply(my_test2, axis=1, x='foo')
....: big['Value_bar'] = big.apply(my_test2, axis=1, x='bar')
....: big['Value_blah'] = big.apply(my_test2, axis=1, x='blah')
....:
1 loop, best of 3: 10.5 s per loop

In [83]: big = pd.concat([df] * 10**4, ignore_index=True)

In [84]: big.shape
Out[84]: (90000, 3)

In [85]: %%timeit
....: for k,v in d.items():
....:     big['Value_{}'.format(k)] = big.a % big.c + v
....:
100 loops, best of 3: 7.24 ms per loop
``````

Conclusion: vectorized approach is 1450 times faster...

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download