user6622314 user6622314 - 1 year ago 123
Python Question

Binning Pandas column values by standard deviation centered on average?

I have a Pandas data frame with a bunch of values in sorted order:

df = pd.DataFrame(np.arange(1,21))

I want to end up with a list/array like this:


The first and last element are
, the center element is the
of the dataframe, and the surrounding elements are all in increments in of

Is there a way to vectorize this for large DataFrames?

UPDATE (Efficient method is in the answers below!)

a = np.arange(df[0].mean(),df[0].min(),-0.5*df[0].std())
b = np.arange(df[0].mean(),df[0].max(),0.5*df[0].std())
c = np.concatenate((a,b))
c = np.append(c,[df[0].min(),df[0].max()])
c = np.unique(c)

And then use
to move values to appropriate bins.

If you find a more efficient way though, that would be helpful!

Answer Source

mu_sig calculates various multiples of standard deviations by multiplying [-2, -1, 0, 1, 2] by sigma.

edges takes a series and gets mu_sig results. Then checks to see that the series minimum is less then minimum multiple of standard deviation less the mean. If it is, then prepend it to list. Do the same check for max.

def edges(s, n=7, rnd=2, sig_mult=1):
    mu = s.mean()
    sig = s.std()
    mn = s.min()
    mx = s.max()

    sig = np.arange(-n // 2, (n + 1) // 2 + 1) * sig * sig_mult
    ms = (mu + sig)

    # Checking if mins and maxs are in range of sigs
    if mn < ms.min():
        ms = np.concatenate([[mn], ms])
    if mx > max(ms):
        ms = np.concatenate([ms, [mx]])

    return ms.round(rnd).tolist()

It works on a series, so I'll squeeze your dataframe

df = pd.DataFrame(np.arange(1,21))
s = df.squeeze()

Then use edges


edges(s, sig_mult=.5, n=5)

[1, 1.63, 4.58, 7.54, 10.5, 13.46, 16.42, 19.37, 20]


[1, -13.16, -7.25, -1.33, 4.58, 10.5, 16.42, 22.33, 28.25, 34.16, 20]

This returns a list of length 11 by default. You can pass n to get different length lists.

edges(s, n=3)

[1, -1.33, 4.58, 10.5, 16.42, 22.33, 20]

Anticipating that you may want to change this to different multiples of standard deviation, you can also do:

edges(df, n=3, sig_mult=.2)

[1, 8.13, 9.32, 10.5, 11.68, 12.87, 20]


Series of length 20

enter image description here

Series of length 1,000,000

enter image description here

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download