Gilberto - 6 months ago 34

Python Question

I want a function, e.g.

`get_cluster(df, numspan)`

`df`

`numspan`

`df_cluster`

In other words:

- take the df, e.g. (not necessarily ordered, may be real numbers)
`1, 2, 3, 4, 5`

- get the max and min
`5`

`1`

- calculate the difference , which represent the main set width
`5 - 1 = 4`

- divide the difference by numspan, e.g. to get the subset unit width
`2`

`2`

- then for every item of the DataFrame check which subset it belongs to (the rule is
*L1 <= x < L2*where*L1*and*L2*are the lower and upper subset limit) - return a number which represents the related subset, so the final df_cluster is (the last label corresponding to the max upper limit is included by rule)
`1, 1, 2, 2, 2`

My code (with another example, see the picture below also):

`import pandas as pd`

df = pd.DataFrame({'A':pd.Series([4, 8, 2, 3])})

def get_cluster(df, numspan):

min = df.min() # e.g. 2

max = df.max() # e.g. 8

span = max - min # e.g. 6

subset_unit = span/numspan # e.g. 6/3 = 2 -> every subset is 2 width

# code I need...

return df_cluster

df['Cluster'] = get_cluster(df, 3)

df

A Cluster

0 4 2

1 8 3 <= included by rule

2 2 1

3 3 1

In picture:

Thank you very much for your help and your time,

Gilberto

Answer

This is called `pd.cut`

where a `bins=`

argument will allow you to set the number you *numspan* in the question.

It returns bin ranges by default. `labels=False`

is a parameter you can use to get a bin number instead.