dustin dustin - 2 months ago 8
Python Question

Python: creating a group column in a Pandas dataframe based on an integer range of values

For each range

[0, 150]
in the
diff
column, I want to create a group column that increases by 1 each time the range resets. When
diff
is negative, the range resets.

import pandas as pd
df = pd.DataFrame({'year': [2016, 2016, 2016, 2016, 2016, 2016, 2016],
'month' : [1, 1, 2, 3, 3, 3, 3],
'day': [23, 25, 1, 1, 7, 20, 30]})
df = pd.to_datetime(df)
df = pd.concat([df, pd.Series(data=[15, 35, 80, 5, 20, 45, 90])], axis=1)
df.columns = ['date', 'percentworn']
col_shift = ['percentworn']
df_shift = df.shift(1).loc[:, col_shift]
df_combined = df.join(df_shift, how='inner', rsuffix='_2')
df_combined.fillna(value=0,inplace=True)
df_combined['diff'] = df_combined['percentworn'] - df_combined['percentworn_2']


enter image description here

The
grp
columns should have
0, 0, 0, 1, 1, 1, 1
. The code I tried was

def grping(df):
df_ = df.copy(deep=True)
i = 0
if df_['diff'] >= 0:
df_['grp'] = i
else:
i += 1
df_['grp'] = i
return df_
df_combined.apply(grping,axis=1)


I need to
i += 1
persist after incrementing. How can I achieve this? Or is there a better way get the desired results?

enter image description here

Answer

IIUC you can test whether the 'diff' column is negative which produces a boolean array, then cast this to int and the call cumsum:

In [313]:
df_combined['group'] = (df_combined['diff'] < 0).astype(int).cumsum()
df_combined

Out[313]:
        date  percentworn  percentworn_2  diff  group
0 2016-01-23           15            0.0  15.0      0
1 2016-01-25           35           15.0  20.0      0
2 2016-02-01           80           35.0  45.0      0
3 2016-03-01            5           80.0 -75.0      1
4 2016-03-07           20            5.0  15.0      1
5 2016-03-20           45           20.0  25.0      1
6 2016-03-30           90           45.0  45.0      1

breaking the above down:

In [314]:
df_combined['diff'] < 0

Out[314]:
0    False
1    False
2    False
3     True
4    False
5    False
6    False
Name: diff, dtype: bool

In [316]:
(df_combined['diff'] < 0).astype(int)

Out[316]:
0    0
1    0
2    0
3    1
4    0
5    0
6    0
Name: diff, dtype: int32
Comments