astabada - 8 months ago 21

Python Question

I have a set of data, and want to make an histogram of it. I need the bins to have the same *size*, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having *equally spaced* bins.

This will naturally come at the expenses of the bins widths, which can - and in general will - be different.

I will specify the number of desired bins and the data set, obtaining the bins edges in return.

`Example:`

data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])

bins_edges = somefunc(data, nbins=3)

print(bins_edges)

>> [1.,1.3,2.1,2.12]

So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.

There are two limitations:

- if a group of data is identical, the bin containing them could be bigger.

- if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.

This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?

`1 import numpy as np`

2

3 def def_equbin(in_distr, binsize=None, bin_num=None):

4

5 try:

6

7 distr_size = len(in_distr)

8

9 bin_size = distr_size / bin_num

10 odd_bin_size = distr_size % bin_num

11

12 args = in_distr.argsort()

13

14 hist = np.zeros((bin_num, bin_size))

15

16 for i in range(bin_num):

17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]]

18

19 if odd_bin_size == 0:

20 odd_bin = None

21 bins_limits = np.arange(bin_num) * bin_size

22 bins_limits = args[bins_limits]

23 bins_limits = np.concatenate((in_distr[bins_limits],

24 [in_distr[args[-1]]]))

25 else:

26 odd_bin = in_distr[args[bin_num * bin_size:]]

27 bins_limits = np.arange(bin_num + 1) * bin_size

28 bins_limits = args[bins_limits]

29 bins_limits = in_distr[bins_limits]

30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]]))

31

32 return (hist, odd_bin, bins_limits)

Answer

Using your example case (bins of 2 points, 6 total data points):

```
from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])
```