ShanZhengYang ShanZhengYang - 4 months ago 40
Python Question

Pandas dataframe: how to cluster together groups by values without machine learning?

I have the following pandas DataFrame.

import pandas as pd
df = pd.read_csv('filename.csv')

print(df)

A B C D
0 2 0 11 0.053095
1 2 0 11 0.059815
2 0 35 11 0.055268
3 0 35 11 0.054573
4 0 1 11 0.054081
5 0 2 11 0.054426
6 0 1 11 0.054426
7 0 1 11 0.054426
8 42 7 3 0.048208
9 42 7 3 0.050765
10 42 7 3 0.05325

....


The problem is, the data is naturally "clustered" into groups, but this data is not given. From the above, rows 0-1 are one group, rows 2-3 are a group, rows 4-7 are a group, and 8-10 are a group.

I need to impute this information. One could use machine learning; however, is it possible to do this only using pandas?

Can users groupby the values of the columns to create these groups? The problem is the values are not exact. For the third group, column
B
has group 1, 2, 1, 1.

Answer

A pure pandas solution would involve binning, assuming that your values are close to each other and your bin size is large enough for cluster variation but smaller than distance between cluster values. That answer depends on your data.

The binning approach uses the cut function in pandas. You provide a series (or array) and the number of bins you want to the function. The function evenly subdivides the range of your series into the given number of bins and determines where each value in the input falls. The output for the below set of columns will be which bin the value fell in and will be what you can group by, following your original train of thought.

The way this would come out in practice for bins of size ~5 is

for col in df.columns:
   binned_name = col + '_binned'
   num_bins = np.ceil(df[col].max()/5)
   df[binned_name] = pd.cut(df[col],num_bins,labels=False)
Comments