user6622314 - 1 year ago 100

Python Question

I have a Pandas data frame with a bunch of values in sorted order:

`df = pd.DataFrame(np.arange(1,21))`

I want to end up with a list/array like this:

`[0,1.62,4.58,7.54,10.5,13.45,16.4,19.37,20]`

The first and last element are

`df.min()`

`df.max()`

`df.mean()`

`0.5*df.std()`

Is there a way to vectorize this for large DataFrames?

UPDATE (Efficient method is in the answers below!)

`a = np.arange(df[0].mean(),df[0].min(),-0.5*df[0].std())`

b = np.arange(df[0].mean(),df[0].max(),0.5*df[0].std())

c = np.concatenate((a,b))

c = np.append(c,[df[0].min(),df[0].max()])

c = np.unique(c)

And then use

`np.digitize()`

If you find a more efficient way though, that would be helpful!

Answer Source

`mu_sig`

calculates various multiples of standard deviations by multiplying `[-2, -1, 0, 1, 2]`

by sigma.

`edges`

takes a series and gets `mu_sig`

results. Then checks to see that the series minimum is less then minimum multiple of standard deviation less the mean. If it is, then prepend it to list. Do the same check for max.

```
def edges(s, n=7, rnd=2, sig_mult=1):
mu = s.mean()
sig = s.std()
mn = s.min()
mx = s.max()
sig = np.arange(-n // 2, (n + 1) // 2 + 1) * sig * sig_mult
ms = (mu + sig)
# Checking if mins and maxs are in range of sigs
if mn < ms.min():
ms = np.concatenate([[mn], ms])
if mx > max(ms):
ms = np.concatenate([ms, [mx]])
return ms.round(rnd).tolist()
```

It works on a series, so I'll squeeze your dataframe

```
df = pd.DataFrame(np.arange(1,21))
s = df.squeeze()
```

Then use `edges`

```
edges(s, sig_mult=.5, n=5)
[1, 1.63, 4.58, 7.54, 10.5, 13.46, 16.42, 19.37, 20]
```

```
edges(s)
[1, -13.16, -7.25, -1.33, 4.58, 10.5, 16.42, 22.33, 28.25, 34.16, 20]
```

This returns a list of length 11 by default. You can pass `n`

to get different length lists.

```
edges(s, n=3)
[1, -1.33, 4.58, 10.5, 16.42, 22.33, 20]
```

Anticipating that you may want to change this to different multiples of standard deviation, you can also do:

```
edges(df, n=3, sig_mult=.2)
[1, 8.13, 9.32, 10.5, 11.68, 12.87, 20]
```

**Series of length 20**

**Series of length 1,000,000**