JohnE JohnE - 2 months ago 7x
Python Question

Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

Edit to add: I don't think the numba benchmarks are fair, notes below

I'm trying to benchmark different approaches to numerically processing data for the following use case:

  1. Fairly big dataset (100,000+ records)

  2. 100+ lines of fairly simple code (z = x + y)

  3. Don't need to sort or index

In other words, the full generality of series and dataframes is not needed, although they are included here b/c they are still convenient ways to encapsulate the data and there is often pre- or post-processing that does require the generality of pandas over numpy arrays.

Question: Based on this use case, are the following benchmarks appropriate and if not, how can I improve them?

# importing pandas, numpy, Series, DataFrame in standard way
from numba import jit
nobs = 10000
nlines = 100

def proc_df():
df = DataFrame({ 'x': np.random.randn(nobs),
'y': np.random.randn(nobs) })
for i in range(nlines):
df['z'] = df.x + df.y
return df.z

def proc_ser():
x = Series(np.random.randn(nobs))
y = Series(np.random.randn(nobs))
for i in range(nlines):
z = x + y
return z

def proc_arr():
x = np.random.randn(nobs)
y = np.random.randn(nobs)
for i in range(nlines):
z = x + y
return z

def proc_numba():
xx = np.random.randn(nobs)
yy = np.random.randn(nobs)
zz = np.zeros(nobs)
for j in range(nobs):
x, y = xx[j], yy[j]
for i in range(nlines):
z = x + y
zz[j] = z
return zz

Results (Win 7, 3 year old Xeon workstation (quad-core). Standard and recent anaconda distribution or very close.)

In [1251]: %timeit proc_df()
10 loops, best of 3: 46.6 ms per loop

In [1252]: %timeit proc_ser()
100 loops, best of 3: 15.8 ms per loop

In [1253]: %timeit proc_arr()
100 loops, best of 3: 2.02 ms per loop

In [1254]: %timeit proc_numba()
1000 loops, best of 3: 1.04 ms per loop # may not be valid result (see note below)

Edit to add (response to jeff) alternate results from passing df/series/array into functions rather than creating them inside of functions (i.e. move the code lines containing 'randn' from inside function to outside function):

10 loops, best of 3: 45.1 ms per loop
100 loops, best of 3: 15.1 ms per loop
1000 loops, best of 3: 1.07 ms per loop
100000 loops, best of 3: 17.9 µs per loop # may not be valid result (see note below)

Note on numba results: I think the numba compiler must be optimizing on the for loop and reducing the for loop to a single iteration. I don't know that but it's the only explanation I can come up as it couldn't be 50x faster than numpy, right? Followup question here: Why is numba faster than numpy here?


Well, you are not really timing the same things here (or rather, you are timing different aspects).


In [6]:    x = Series(np.random.randn(nobs))

In [7]:    y = Series(np.random.randn(nobs))

In [8]:  %timeit x + y
10000 loops, best of 3: 131 µs per loop

In [9]:  %timeit Series(np.random.randn(nobs)) + Series(np.random.randn(nobs))
1000 loops, best of 3: 1.33 ms per loop

So [8] times the actual operation, while [9] includes the overhead for the series creation (and the random number generation) PLUS the actual operation

Another example is proc_ser vs proc_df. The proc_df includes the overhead of assignement of a particular column in the DataFrame (which is actually different for an initial creation and subsequent reassignement).

So create the structure (you can time that too, but that is a separate issue). Perform the exact same operation and time them.

Further you say that you don't need alignment. Pandas gives you this by default (and no really easy way to turn it off, though its just a simple check if they are already aligned). While in numba you need to 'manually' align them.