user6622314 user6622314 - 4 months ago 28
Python Question

Binning Pandas column values by standard deviation centered on average?

I have a Pandas data frame with a bunch of values in sorted order:

df = pd.DataFrame(np.arange(1,21))


I want to end up with a list/array like this:

[0,1.62,4.58,7.54,10.5,13.45,16.4,19.37,20]


The first and last element are
df.min()
and
df.max()
, the center element is the
df.mean()
of the dataframe, and the surrounding elements are all in increments in of
0.5*df.std()


Is there a way to vectorize this for large DataFrames?

UPDATE (Efficient method is in the answers below!)

a = np.arange(df[0].mean(),df[0].min(),-0.5*df[0].std())
b = np.arange(df[0].mean(),df[0].max(),0.5*df[0].std())
c = np.concatenate((a,b))
c = np.append(c,[df[0].min(),df[0].max()])
c = np.unique(c)


And then use
np.digitize()
to move values to appropriate bins.

If you find a more efficient way though, that would be helpful!

Answer

mu_sig calculates various multiples of standard deviations by multiplying [-2, -1, 0, 1, 2] by sigma.

edges takes a series and gets mu_sig results. Then checks to see that the series minimum is less then minimum multiple of standard deviation less the mean. If it is, then prepend it to list. Do the same check for max.

def edges(s, n=7, rnd=2, sig_mult=1):
    mu = s.mean()
    sig = s.std()
    mn = s.min()
    mx = s.max()

    sig = np.arange(-n // 2, (n + 1) // 2 + 1) * sig * sig_mult
    ms = (mu + sig)

    # Checking if mins and maxs are in range of sigs
    if mn < ms.min():
        ms = np.concatenate([[mn], ms])
    if mx > max(ms):
        ms = np.concatenate([ms, [mx]])

    return ms.round(rnd).tolist()

It works on a series, so I'll squeeze your dataframe

df = pd.DataFrame(np.arange(1,21))
s = df.squeeze()

Then use edges

THIS IS YOUR ANSWER

edges(s, sig_mult=.5, n=5)

[1, 1.63, 4.58, 7.54, 10.5, 13.46, 16.42, 19.37, 20]

edges(s)

[1, -13.16, -7.25, -1.33, 4.58, 10.5, 16.42, 22.33, 28.25, 34.16, 20]

This returns a list of length 11 by default. You can pass n to get different length lists.

edges(s, n=3)

[1, -1.33, 4.58, 10.5, 16.42, 22.33, 20]

Anticipating that you may want to change this to different multiples of standard deviation, you can also do:

edges(df, n=3, sig_mult=.2)

[1, 8.13, 9.32, 10.5, 11.68, 12.87, 20]

Timing

Series of length 20

enter image description here

Series of length 1,000,000

enter image description here

Comments