Michael Hooreman - 5 months ago 71

Python Question

I would like to "bin" a

`numpy.array`

`pandas.Series`

`x`

`N`

`begin`

`end`

`pandas.DataFrame`

`import numpy as np`

import pandas as pd

bins = pd.DataFrame({'from': np.arange(0, 1, 0.01), 'to': np.arange(0, 1, 0.01) + 0.1})

x = np.random.rand(1000000)

bins['N'] = bins.apply(lambda r: ((x >= r['from']) & (x < r['to'])).sum(), axis=1)

When I profile the code, the slowest part of the whole script (which contains many more things) if the last line, especially the lambda: 15 % of the time is spent in that lambda!

I have the feeling the instead of using a

`lambda`

I'm using Python 3.5, numpy 1.11 and pandas 0.18.1

EDIT: Additional information + test with intervaltree

In fact, that binning is used iteratively: the binning is initiated with some data, and might be updated with other data sets.

As suggested, I've tried with intervaltree, but it even get worst in terms of performances. From the second iteration, I received a user arning because the '+' is not supported by numexpr on boold dtype, so it switches to pure python mode.

Answer

If *"...boundaries have fixed width, as [[min + 0 width, min + 1 width], [min + 1 width, min + 2 width], ..., [max - 1 width, max]]..."*, use `numpy.histogram`

:

```
bins["N"] = numpy.histogram(x, numpy.concatenate([bins["from"], bins["to"].tail(1)]))[0]
```

It would be simpler than this, but if you have the last edge in bins["to"], you need to include it in the list of bins edges.

For details: http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html