piRSquared piRSquared - 6 months ago 28
Python Question

How do I access a numpy array as quickly as a pandas dataframe

I ran a comparison of several ways to access data in a

DataFrame
. See results below. The quickest access was from using the
get_value
method on a
DataFrame
. I was referred to this on this post.

What I was surprised by is that the access via
get_value
is quicker than accessing via the underlying numpy object
df.values
.

Question



My question is, is there a way to access elements of a numpy array as quickly as I can access a pandas dataframe via
get_value
?

Setup



import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(16).reshape(4, 4))


Testing



%%timeit
df.iloc[2, 2]



10000 loops, best of 3: 108 µs per loop


%%timeit
df.values[2, 2]



The slowest run took 5.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.02 µs per loop


%%timeit
df.iat[2, 2]



The slowest run took 4.96 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.85 µs per loop


%%timeit
df.iat[1, 2]



The slowest run took 5.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.6 µs per loop


%%timeit
df.get_value(2, 2)



The slowest run took 19.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.57 µs per loop

Answer

iloc is pretty general, accepting slices and lists as well as simple integers. In the case above, where you have simple integer indexing, pandas first determines that it is a valid integer, then it converts the request to an iat index, so clearly it will be much slower. iat eventually resolves down to a call to get_value, so naturally a direct call to get_value is going to be fast. get_value itself is cached, so micro-benchmarks like these may not reflect performance in real code.

df.values does return an ndarray, but only after checking that it is a single contiguous block. This requires a few lookups and tests so it is a little slower than retrieving the value from the cache.

We can defeat the caching by creating a new data frame every time. This shows that values accessor is fastest, at least for data of a uniform type:

In [111]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4))
10000 loops, best of 3: 186 µs per loop

In [112]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.values[2,2]
1000 loops, best of 3: 200 µs per loop

In [113]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.get_value(2,2)
1000 loops, best of 3: 309 µs per loop

In [114]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iat[2,2]
1000 loops, best of 3: 308 µs per loop

In [115]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iloc[2,2]
1000 loops, best of 3: 420 µs per loop

In [116]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.ix[2,2]
1000 loops, best of 3: 316 µs per loop

The code claims that ix is the most general, and so should be in theory be slower than iloc; it may be that your particular test favours ix but other tests may favour iloc just because of the order of the tests needed to identify the index as a scalar index.

Comments