Vivek Kalyanarangan Vivek Kalyanarangan - 1 year ago 139
Python Question

Group by consecutive count Pandas Python

Here's the data:

ID Type
1 In
1 In
1 Out
1 In
2 Out
2 In
2 In
2 In
2 Out

Question: I want a pandas query that can fetch me the IDs where "In" appears more than 2 times consecutively. So, if I run the query on the above data it should only fetch me 2.

Answer Source

Try this:

df.groupby('ID')['Type'].agg(lambda x: (x=='In').rolling(3).apply(lambda x: x.all()).max())
1    0.0
2    1.0
Name: Type, dtype: float64

It will return 1 for the groups that meet your criterion, 0 otherwise.

It first groups by ID and takes the Type column. For your example, it has two groups: {1: ['In', 'In', 'Out', 'In'], 2: ['Out', 'In', 'In', 'In', 'Out']}. For each group (x), it first creates a boolean series x=='In'. The series are [True, True, False, True] and [False, True, True, True, False]. Now, on those series, it applies the rolling function. It takes three at a time and evaluates x.all(). For the first group, the first three ([True, True, False]) and the second three ([True, False, True]) returns False because all three should be True. The maximum of these two False's is 0. For the second group, the rolling method will produce ([False, True, True], [True, True, True], [True, True, False]) so for the second one x.all() will be True and therefore the maximum will be 1.

Series.rolling() was introduced in pandas 0.18 I believe. For earlier versions, you can use:

df.groupby('ID')['Type'].agg(lambda x: pd.rolling_apply(x=='In', 3, lambda x: x.all()).max())
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download