adele adele - 10 months ago 55
Python Question

Statistics based on number of matches in dataframe column

I'm looking for a Pythonic approach to capture stats based on the amount of matches in a DF column. So working with this example:

rng = pd.DataFrame( {'initial_data': ['A', 'A','A', 'A', 'B','B', 'A' , 'A', 'A', 'A','B' , 'B', 'B', 'A',]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
test_B_mask = rng['initial_data'] == 'B'
rng['test_for_B'] = rng['initial_data'][test_B_mask]

and running this function to provide matches:

def func_match(df_in,val):
return ((df_in == val) & (df_in.shift() == val)).astype(int)

I get the following output:

2014-04-02 09:00:00 0
2014-04-02 10:00:00 0
2014-04-02 11:00:00 0
2014-04-02 12:00:00 0
2014-04-02 13:00:00 0
2014-04-02 14:00:00 1
2014-04-02 15:00:00 0
2014-04-02 16:00:00 0
2014-04-03 09:00:00 0
2014-04-03 10:00:00 0
2014-04-03 11:00:00 0
2014-04-03 12:00:00 1
2014-04-03 13:00:00 1
2014-04-03 14:00:00 0
Freq: BH, Name: test_for_B, dtype: int64

I can use something simple like

which returns


to get the amount if times the values match in total but could someone help with a function to provide the following more granular function please?

  • Amount and percentage of times a single match is seen.

  • Amount and percentage of times two consecutive matches are seen (up to n max matches which is just 3 matches 2014-04-02 11:00:00 through 13:00:00 in this example).

I'm guessing this would be a dict used within the function but Im sure many of the experienced coders on Stack Overflow are used to conducting this kind of analysis so would love to learn how to approach this task.

Thank you in advance for any help with this.


I didn't initially specify desired output as I am open to all options and didn't want to deter anyone from providing solutions. However as per request from MaxU for desired output, something like this would be great:

Matches Matches_Percent
0 match 3 30
1 match 4 40
2 match 2 20
3 match 1 10

Answer Source

Initial setup

rng = pd.DataFrame({'initial_data': ['A', 'A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A',]},
                   index = pd.date_range('4/2/2014', periods=14, freq='BH'))

Assign bool to columns 'test_for_B'

rng['test_for_B'] = rng['initial_data'] == 'B'

Tricky bit

Test for 'B' and that last row was not 'B'. This signifies the beginning of a group. Then cumsum ties groups together.

contigious_groups = ((rng.initial_data == 'B') & (rng.initial_data != rng.initial_data.shift())).cumsum()

Now I groupby this grouping we created and sum the bools within each group. This gets at whether its a double, triple, etc.

counts = rng.loc[contigious_groups.astype(bool)].groupby(contigious_groups).test_for_B.sum()

Then use value_counts to get the frequency of each group type and divide by contigious_groups.max() because that's a count of how many groups.

counts.value_counts() / contigious_groups.max()

3.0    0.5
2.0    0.5
Name: test_for_B, dtype: float64