user6622314 - 4 months ago 28
Python Question

Binning Pandas column values by standard deviation centered on average?

I have a Pandas data frame with a bunch of values in sorted order:

``````df = pd.DataFrame(np.arange(1,21))
``````

I want to end up with a list/array like this:

``````[0,1.62,4.58,7.54,10.5,13.45,16.4,19.37,20]
``````

The first and last element are
`df.min()`
and
`df.max()`
, the center element is the
`df.mean()`
of the dataframe, and the surrounding elements are all in increments in of
`0.5*df.std()`

Is there a way to vectorize this for large DataFrames?

UPDATE (Efficient method is in the answers below!)

``````a = np.arange(df[0].mean(),df[0].min(),-0.5*df[0].std())
b = np.arange(df[0].mean(),df[0].max(),0.5*df[0].std())
c = np.concatenate((a,b))
c = np.append(c,[df[0].min(),df[0].max()])
c = np.unique(c)
``````

And then use
`np.digitize()`
to move values to appropriate bins.

If you find a more efficient way though, that would be helpful!

`mu_sig` calculates various multiples of standard deviations by multiplying `[-2, -1, 0, 1, 2]` by sigma.

`edges` takes a series and gets `mu_sig` results. Then checks to see that the series minimum is less then minimum multiple of standard deviation less the mean. If it is, then prepend it to list. Do the same check for max.

``````def edges(s, n=7, rnd=2, sig_mult=1):
mu = s.mean()
sig = s.std()
mn = s.min()
mx = s.max()

sig = np.arange(-n // 2, (n + 1) // 2 + 1) * sig * sig_mult
ms = (mu + sig)

# Checking if mins and maxs are in range of sigs
if mn < ms.min():
ms = np.concatenate([[mn], ms])
if mx > max(ms):
ms = np.concatenate([ms, [mx]])

return ms.round(rnd).tolist()
``````

It works on a series, so I'll squeeze your dataframe

``````df = pd.DataFrame(np.arange(1,21))
s = df.squeeze()
``````

Then use `edges`

``````edges(s, sig_mult=.5, n=5)

[1, 1.63, 4.58, 7.54, 10.5, 13.46, 16.42, 19.37, 20]
``````

``````edges(s)

[1, -13.16, -7.25, -1.33, 4.58, 10.5, 16.42, 22.33, 28.25, 34.16, 20]
``````

This returns a list of length 11 by default. You can pass `n` to get different length lists.

``````edges(s, n=3)

[1, -1.33, 4.58, 10.5, 16.42, 22.33, 20]
``````

Anticipating that you may want to change this to different multiples of standard deviation, you can also do:

``````edges(df, n=3, sig_mult=.2)

[1, 8.13, 9.32, 10.5, 11.68, 12.87, 20]
``````

Timing

Series of length 20

Series of length 1,000,000