Michael Hooreman Michael Hooreman - 5 months ago 71
Python Question

Numpy/pandas optimization: bins counting

I would like to "bin" a

numpy.array
or
pandas.Series
x
by counting the number
N
of values between
begin
and
end
. The result is stored in a
pandas.DataFrame
:

import numpy as np
import pandas as pd
bins = pd.DataFrame({'from': np.arange(0, 1, 0.01), 'to': np.arange(0, 1, 0.01) + 0.1})
x = np.random.rand(1000000)
bins['N'] = bins.apply(lambda r: ((x >= r['from']) & (x < r['to'])).sum(), axis=1)


When I profile the code, the slowest part of the whole script (which contains many more things) if the last line, especially the lambda: 15 % of the time is spent in that lambda!

I have the feeling the instead of using a
lambda
, I must implemented in a vectorized way, but I can't figure out how.

I'm using Python 3.5, numpy 1.11 and pandas 0.18.1




EDIT: Additional information + test with intervaltree

In fact, that binning is used iteratively: the binning is initiated with some data, and might be updated with other data sets.

As suggested, I've tried with intervaltree, but it even get worst in terms of performances. From the second iteration, I received a user arning because the '+' is not supported by numexpr on boold dtype, so it switches to pure python mode.

Answer

If "...boundaries have fixed width, as [[min + 0 width, min + 1 width], [min + 1 width, min + 2 width], ..., [max - 1 width, max]]...", use numpy.histogram:

bins["N"] = numpy.histogram(x, numpy.concatenate([bins["from"], bins["to"].tail(1)]))[0]

It would be simpler than this, but if you have the last edge in bins["to"], you need to include it in the list of bins edges.

For details: http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html