adele - 1 year ago 76
Python Question

# Statistics based on number of matches in dataframe column

I'm looking for a Pythonic approach to capture stats based on the amount of matches in a DF column. So working with this example:

``````rng = pd.DataFrame( {'initial_data': ['A', 'A','A', 'A', 'B','B', 'A' , 'A', 'A', 'A','B' , 'B', 'B', 'A',]},  index = pd.date_range('4/2/2014', periods=14, freq='BH'))
``````

and running this function to provide matches:

``````def func_match(df_in,val):
return ((df_in == val) & (df_in.shift() == val)).astype(int)
func_match(rng['test_for_B'],rng['test_for_B'])
``````

I get the following output:

``````2014-04-02 09:00:00    0
2014-04-02 10:00:00    0
2014-04-02 11:00:00    0
2014-04-02 12:00:00    0
2014-04-02 13:00:00    0
2014-04-02 14:00:00    1
2014-04-02 15:00:00    0
2014-04-02 16:00:00    0
2014-04-03 09:00:00    0
2014-04-03 10:00:00    0
2014-04-03 11:00:00    0
2014-04-03 12:00:00    1
2014-04-03 13:00:00    1
2014-04-03 14:00:00    0
Freq: BH, Name: test_for_B, dtype: int64
``````

I can use something simple like
`func_match(rng['test_for_B'],rng['test_for_B']).sum()`

which returns

``````3
``````

to get the amount if times the values match in total but could someone help with a function to provide the following more granular function please?

• Amount and percentage of times a single match is seen.

• Amount and percentage of times two consecutive matches are seen (up to n max matches which is just 3 matches 2014-04-02 11:00:00 through 13:00:00 in this example).

I'm guessing this would be a dict used within the function but Im sure many of the experienced coders on Stack Overflow are used to conducting this kind of analysis so would love to learn how to approach this task.

Thank you in advance for any help with this.

Edit:

I didn't initially specify desired output as I am open to all options and didn't want to deter anyone from providing solutions. However as per request from MaxU for desired output, something like this would be great:

``````  Matches       Matches_Percent
0 match    3       30
1 match    4       40
2 match    2       20
3 match    1       10
etc
``````

### Initial setup

``````rng = pd.DataFrame({'initial_data': ['A', 'A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A',]},
index = pd.date_range('4/2/2014', periods=14, freq='BH'))
``````

### Assign `bool` to columns `'test_for_B'`

``````rng['test_for_B'] = rng['initial_data'] == 'B'
``````

### Tricky bit

Test for `'B'` and that last row was not `'B'`. This signifies the beginning of a group. Then `cumsum` ties groups together.

``````contigious_groups = ((rng.initial_data == 'B') & (rng.initial_data != rng.initial_data.shift())).cumsum()
``````

Now I `groupby` this grouping we created and `sum` the `bool`s within each group. This gets at whether its a double, triple, etc.

``````counts = rng.loc[contigious_groups.astype(bool)].groupby(contigious_groups).test_for_B.sum()
``````

Then use `value_counts` to get the frequency of each group type and divide by `contigious_groups.max()` because that's a count of how many groups.

``````counts.value_counts() / contigious_groups.max()

3.0    0.5
2.0    0.5
Name: test_for_B, dtype: float64
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download