Gilberto Gilberto - 1 month ago 13
Python Question

How to classify/label pandas dataframe between minimum and maximum

I want a function, e.g.

get_cluster(df, numspan)
, that, given a pandas DataFrame
df
and an integer
numspan
as inputs, returns a DataFrame
df_cluster
of labels (numbers), that represent membership in the subset calculated according to the difference between max and min of the DataFrame divided by numspan.

In other words:


  1. take the df, e.g.
    1, 2, 3, 4, 5
    (not necessarily ordered, may be real numbers)

  2. get the max
    5
    and min
    1

  3. calculate the difference
    5 - 1 = 4
    , which represent the main set width

  4. divide the difference by numspan, e.g.
    2
    to get the subset unit width
    2

  5. then for every item of the DataFrame check which subset it belongs to (the rule is L1 <= x < L2 where L1 and L2 are the lower and upper subset limit)

  6. return a number which represents the related subset, so the final df_cluster is
    1, 1, 2, 2, 2
    (the last label corresponding to the max upper limit is included by rule)



My code (with another example, see the picture below also):

import pandas as pd
df = pd.DataFrame({'A':pd.Series([4, 8, 2, 3])})

def get_cluster(df, numspan):
min = df.min() # e.g. 2
max = df.max() # e.g. 8
span = max - min # e.g. 6
subset_unit = span/numspan # e.g. 6/3 = 2 -> every subset is 2 width

# code I need...

return df_cluster

df['Cluster'] = get_cluster(df, 3)
df
A Cluster
0 4 2
1 8 3 <= included by rule
2 2 1
3 3 1


In picture:

Picture of the example

Thank you very much for your help and your time,

Gilberto

Answer

This is called pd.cut where a bins= argument will allow you to set the number you numspan in the question.

It returns bin ranges by default. labels=False is a parameter you can use to get a bin number instead.

Comments