Gilberto Gilberto - 1 year ago 110
Python Question

How to classify/label pandas dataframe between minimum and maximum

I want a function, e.g.

get_cluster(df, numspan)
, that, given a pandas DataFrame
and an integer
as inputs, returns a DataFrame
of labels (numbers), that represent membership in the subset calculated according to the difference between max and min of the DataFrame divided by numspan.

In other words:

  1. take the df, e.g.
    1, 2, 3, 4, 5
    (not necessarily ordered, may be real numbers)

  2. get the max
    and min

  3. calculate the difference
    5 - 1 = 4
    , which represent the main set width

  4. divide the difference by numspan, e.g.
    to get the subset unit width

  5. then for every item of the DataFrame check which subset it belongs to (the rule is L1 <= x < L2 where L1 and L2 are the lower and upper subset limit)

  6. return a number which represents the related subset, so the final df_cluster is
    1, 1, 2, 2, 2
    (the last label corresponding to the max upper limit is included by rule)

My code (with another example, see the picture below also):

import pandas as pd
df = pd.DataFrame({'A':pd.Series([4, 8, 2, 3])})

def get_cluster(df, numspan):
min = df.min() # e.g. 2
max = df.max() # e.g. 8
span = max - min # e.g. 6
subset_unit = span/numspan # e.g. 6/3 = 2 -> every subset is 2 width

# code I need...

return df_cluster

df['Cluster'] = get_cluster(df, 3)
A Cluster
0 4 2
1 8 3 <= included by rule
2 2 1
3 3 1

In picture:

Picture of the example

Thank you very much for your help and your time,


Answer Source

This is called pd.cut where a bins= argument will allow you to set the number you numspan in the question.

It returns bin ranges by default. labels=False is a parameter you can use to get a bin number instead.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download