luca luca - 2 months ago 11x
Python Question

Is searchsorted faster than get_loc to find label location in a DataFrame Index?

I need to find the integer location for a label in a Pandas index. I know I can use get_loc method, but then I discovered searchsorted. Just wondering if I should use the latter for speed improvement, as I need to search for thousands of labels.


It will depend on your usecase. using @ayhan's example.

With get_loc there is a big upfront cost of creating the hash table on the first lookup.

In [22]: idx = pd.Index(['R{0:07d}'.format(i) for i in range(10**7)])
In [23]: to_search = np.random.choice(idx, 10**5, replace=False)
In [24]: %time idx.get_loc(to_search[0])
Wall time: 1.57 s

But, subsequent lookups may be faster. (not guaranteed, depends on data)

In [9]: %%time
   ...: for i in to_search:
   ...:     idx.get_loc(i)
Wall time: 200 ms

In [10]: %%time
    ...: for i in to_search:
    ...:     np.searchsorted(idx, i)
Wall time: 486 ms

Also, as Jeff noted, get_loc is guaranteed to always work, where searchsorted requires monotonicity (and doesn't check).