Gilberto - 1 month ago 13
Python Question

# How to classify/label pandas dataframe between minimum and maximum

I want a function, e.g.

`get_cluster(df, numspan)`
, that, given a pandas DataFrame
`df`
and an integer
`numspan`
as inputs, returns a DataFrame
`df_cluster`
of labels (numbers), that represent membership in the subset calculated according to the difference between max and min of the DataFrame divided by numspan.

In other words:

1. take the df, e.g.
`1, 2, 3, 4, 5`
(not necessarily ordered, may be real numbers)

2. get the max
`5`
and min
`1`

3. calculate the difference
`5 - 1 = 4`
, which represent the main set width

4. divide the difference by numspan, e.g.
`2`
to get the subset unit width
`2`

5. then for every item of the DataFrame check which subset it belongs to (the rule is L1 <= x < L2 where L1 and L2 are the lower and upper subset limit)

6. return a number which represents the related subset, so the final df_cluster is
`1, 1, 2, 2, 2`
(the last label corresponding to the max upper limit is included by rule)

My code (with another example, see the picture below also):

``````import pandas as pd
df = pd.DataFrame({'A':pd.Series([4, 8, 2, 3])})

def get_cluster(df, numspan):
min = df.min() # e.g. 2
max = df.max() # e.g. 8
span = max - min # e.g. 6
subset_unit = span/numspan # e.g. 6/3 = 2 -> every subset is 2 width

# code I need...

return df_cluster

df['Cluster'] = get_cluster(df, 3)
df
A  Cluster
0  4        2
1  8        3 <= included by rule
2  2        1
3  3        1
``````

In picture:

This is called `pd.cut` where a `bins=` argument will allow you to set the number you numspan in the question.
It returns bin ranges by default. `labels=False` is a parameter you can use to get a bin number instead.