Michael Hooreman Michael Hooreman - 10 months ago 169
Python Question

Numpy/pandas optimization: bins counting

I would like to "bin" a

by counting the number
of values between
. The result is stored in a

import numpy as np
import pandas as pd
bins = pd.DataFrame({'from': np.arange(0, 1, 0.01), 'to': np.arange(0, 1, 0.01) + 0.1})
x = np.random.rand(1000000)
bins['N'] = bins.apply(lambda r: ((x >= r['from']) & (x < r['to'])).sum(), axis=1)

When I profile the code, the slowest part of the whole script (which contains many more things) if the last line, especially the lambda: 15 % of the time is spent in that lambda!

I have the feeling the instead of using a
, I must implemented in a vectorized way, but I can't figure out how.

I'm using Python 3.5, numpy 1.11 and pandas 0.18.1

EDIT: Additional information + test with intervaltree

In fact, that binning is used iteratively: the binning is initiated with some data, and might be updated with other data sets.

As suggested, I've tried with intervaltree, but it even get worst in terms of performances. From the second iteration, I received a user arning because the '+' is not supported by numexpr on boold dtype, so it switches to pure python mode.


If "...boundaries have fixed width, as [[min + 0 width, min + 1 width], [min + 1 width, min + 2 width], ..., [max - 1 width, max]]...", use numpy.histogram:

bins["N"] = numpy.histogram(x, numpy.concatenate([bins["from"], bins["to"].tail(1)]))[0]

It would be simpler than this, but if you have the last edge in bins["to"], you need to include it in the list of bins edges.

For details: http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html