dsl1990 dsl1990 - 1 month ago 5
Python Question

Multiple slicing rows in pandas on a cell condition

Msgtype Date ConvID message
enquire 12/1 689 I want your car
reply 12/3 689 it is available
reply 12/4 689 rent please?
reply 12/6 689 $200
accept 12/8 689 please pay through CC
reply 12/8 689 thank you, what about fuel?
reply 12/8 689 you have to take care
enquire 12/3 690 Looking for car
reply 12/4 690 available
accept 12/5 690 paid
reply 12/6 690 thank you


I want to group this data by ConvID and sort it by date. I want the rows till "Msgtype" = accept for that particular ConvID. Aim to analyze message data till booking request is accepted for particular ConvID. so for ConvID = 689, I want rows till "Msgtype" = accept. Rest of the rows after "accept" are not required.

Eg: These two are not required for ConvID = 689

Msgtype Date ConvID message
reply 12/8 689 thank you, what about fuel?
reply 12/8 689 you have to take care


Similarly this row is not required for ConvID = 690

Msgtype Date ConvID message
reply 12/6 690 thank you

Answer

I think you can use:

mask1 = (df.Msgtype == 'accept')
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0

print (df[mask].sort_values(['ConvID','Date']))
   Msgtype  Date  ConvID                message
0  enquire  12/1     689        I want your car
1    reply  12/3     689        it is available
2    reply  12/4     689           rent please?
3    reply  12/6     689                   $200
4   accept  12/8     689  please pay through CC
7  enquire  12/3     690        Looking for car
8    reply  12/4     690              available
9   accept  12/5     690                   paid

Explanations:

#mask where is 'accept'
mask1 = (df.Msgtype == 'accept')
print (mask1)
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9      True
10    False
Name: Msgtype, dtype: bool

#per group shift, replace NaN by False and cumulative sum
print (mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()))
0     0
1     0
2     0
3     0
4     0
5     1
6     1
7     0
8     0
9     0
10    1
Name: Msgtype, dtype: int32
#where output of groupby is 0 
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
print (mask)
0      True
1      True
2      True
3      True
4      True
5     False
6     False
7      True
8      True
9      True
10    False
Name: Msgtype, dtype: bool

#boolean indexing and sorting
print (df[mask].sort_values(['ConvID','Date']))
   Msgtype  Date  ConvID                message
0  enquire  12/1     689        I want your car
1    reply  12/3     689        it is available
2    reply  12/4     689           rent please?
3    reply  12/6     689                   $200
4   accept  12/8     689  please pay through CC
7  enquire  12/3     690        Looking for car
8    reply  12/4     690              available
9   accept  12/5     690                   paid
Comments