ihsansat - 3 months ago 32

Python Question

i have dataset :

`recency;frequency;monetary`

21;156;41879955

13;88;16850284

8;74;79150488

2;74;26733719

9;55;16162365

...;...;...

detail raw data -> http://pastebin.com/beiEeS80

and i put into

`DataFrame`

`df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])`

df['recency'] = df['recency'].astype(float)

df['frequency'] = df['frequency'].astype(float)

df['monetary'] = df['monetary'].astype(float)

df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1

df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1

df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1

but it's return error

`df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1`

ValueError: Bin edges must be unique: array([ 1., 1., 2., 4., 9., 156.])

how solved this ?

thx

Answer

I ran this in Jupyter and placed the exampledata.txt to the same directory as the notebook.

Please note that the first line:

```
df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
```

loads the colums `'userid'`

when it isn't defined in the data file. I removed this column name.

```
import pandas as pd
def pct_rank_qcut(series, n):
edges = pd.Series([float(i) / n for i in range(n + 1)])
f = lambda x: (edges >= x).argmax()
return series.rank(pct=1).apply(f)
datas = pd.read_csv('./exampledata.txt', delimiter=';')
df = pd.DataFrame(datas, columns=['recency', 'frequency', 'monetary'])
df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)
df['recency'] = pct_rank_qcut(df.recency, 5)
df['frequency'] = pct_rank_qcut(df.frequency, 5)
df['monetary'] = pct_rank_qcut(df.monetary, 5)
```

The problem you were seeing was a result of pd.qcut assuming 5 bins of equal size. In the data you provided, `'frequency'`

has more than 28% number 1's. This broke `qcut`

.

I provided a new function `pct_rank_qcut`

that addresses this and pushes all 1's into the first bin.

```
edges = pd.Series([float(i) / n for i in range(n + 1)])
```

This line defines a series of percentile edges based on the desired number of bins defined by `n`

. In the case of `n = 5`

the edges will be `[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]`

```
f = lambda x: (edges >= x).argmax()
```

this line defines a helper function to be applied to another series in the next line. `edges >= x`

will return a series equal in length to `edges`

where each element is `True`

or `False`

depending on whether `x`

is less than or equal to that edge. In the case of `x = 0.14`

the resulting `(edges >= x)`

will be `[False, True, True, True, True, True]`

. By the taking the `argmax()`

I've identified the first index where the series is `True`

, in this case `1`

.

```
return series.rank(pct=1).apply(f)
```

This line takes the input `series`

and turns it into a percentile ranking. I can compare these rankings to the edges I've created and that's why I use the `apply(f)`

. What's returned should be a series of bin numbers numbered 1 to n. This series of bin numbers is the same thing you were trying to get with:

```
pd.qcut(df['recency'].values, 5).codes + 1
```

This has consequences in that the bins are no longer equal and that bin 1 borrows completely from bin 2. But some choice had to be made. If you don't like this choice, use the concept to build your own ranking.

```
print df.head()
recency frequency monetary
0 3 5 5
1 2 5 5
2 2 5 5
3 1 5 5
4 2 5 5
```