JohnE - 9 months ago 54

Python Question

**Edit to add**: I don't think the numba benchmarks are fair, notes below

I'm trying to benchmark different approaches to numerically processing data for the following use case:

- Fairly big dataset (100,000+ records)
- 100+ lines of fairly simple code (z = x + y)
- Don't need to sort or index

In other words, the full generality of series and dataframes is not needed, although they are included here b/c they are still convenient ways to encapsulate the data and there is often pre- or post-processing that does require the generality of pandas over numpy arrays.

`# importing pandas, numpy, Series, DataFrame in standard way`

from numba import jit

nobs = 10000

nlines = 100

def proc_df():

df = DataFrame({ 'x': np.random.randn(nobs),

'y': np.random.randn(nobs) })

for i in range(nlines):

df['z'] = df.x + df.y

return df.z

def proc_ser():

x = Series(np.random.randn(nobs))

y = Series(np.random.randn(nobs))

for i in range(nlines):

z = x + y

return z

def proc_arr():

x = np.random.randn(nobs)

y = np.random.randn(nobs)

for i in range(nlines):

z = x + y

return z

@jit

def proc_numba():

xx = np.random.randn(nobs)

yy = np.random.randn(nobs)

zz = np.zeros(nobs)

for j in range(nobs):

x, y = xx[j], yy[j]

for i in range(nlines):

z = x + y

zz[j] = z

return zz

Results (Win 7, 3 year old Xeon workstation (quad-core). Standard and recent anaconda distribution or very close.)

`In [1251]: %timeit proc_df()`

10 loops, best of 3: 46.6 ms per loop

In [1252]: %timeit proc_ser()

100 loops, best of 3: 15.8 ms per loop

In [1253]: %timeit proc_arr()

100 loops, best of 3: 2.02 ms per loop

In [1254]: %timeit proc_numba()

1000 loops, best of 3: 1.04 ms per loop # may not be valid result (see note below)

`10 loops, best of 3: 45.1 ms per loop`

100 loops, best of 3: 15.1 ms per loop

1000 loops, best of 3: 1.07 ms per loop

100000 loops, best of 3: 17.9 µs per loop # may not be valid result (see note below)

Answer

Well, you are not really timing the same things here (or rather, you are timing different aspects).

E.g.

```
In [6]: x = Series(np.random.randn(nobs))
In [7]: y = Series(np.random.randn(nobs))
In [8]: %timeit x + y
10000 loops, best of 3: 131 µs per loop
In [9]: %timeit Series(np.random.randn(nobs)) + Series(np.random.randn(nobs))
1000 loops, best of 3: 1.33 ms per loop
```

So [8] times the actual operation, while [9] includes the overhead for the series creation (and the random number generation) PLUS the actual operation

Another example is `proc_ser`

vs `proc_df`

. The `proc_df`

includes the overhead of assignement of a particular column in the DataFrame (which is actually different for an initial creation and subsequent reassignement).

So create the structure (you can time that too, but that is a separate issue). Perform the exact same operation and time them.

Further you say that you don't need alignment. Pandas gives you this by default (and no really easy way to turn it off, though its just a simple check if they are already aligned). While in numba you need to 'manually' align them.