Vince Vince - 15 days ago 6
Python Question

How to find precision @ N using Pandas in Python

I'd like to calculate the proportion of A grades of systems but only include questions with a certain number of grades. I start with the following data

system question grade
Sys1 Is? A
Sys1 Is? A
Sys1 Is? C
Sys1 How? B
Sys1 How? A
Sys1 How? F
Sys1 How? F
Sys2 Is? A
Sys2 Is? A
Sys2 Is? B
Sys2 How? A


Precision = The proportion of A grades

N = Question has at least N grades in the system

I want to calculate the precision @ n for each system. For example, Precision @ 2 would be the proportion of 'A' grades for each system for questions with at least 2 grades

I'm struggling to bring calculations together as I'm just learning Pandas. Below is an example of what I'm trying to achieve using N=2:


  1. Group by system, question and filter out groups with size < 2





size = df.groupby('system', 'question').size()
size[size >= 2]
system question
Sys1 Is? 3
How? 4
Sys2 Is? 3




  1. Get size (count) of each grade per system


    df.groupby(['system', 'question', 'grade']).size()
    system question grade
    Sys1 Is? A 2
    C 1
    How? A 1
    B 1
    F 2
    Sys2 Is? A 2
    B 1
    How? A 1

  2. Join the groups, only keeping the questions that weren't filtered out in step 1

    ???

  3. Sum up all the A's per system and divide by the total number of grades in the system

    ???



Desired result:

system precision
Sys1 0.43
Sys2 0.66


Sys1 has a precision of 0.42 since it had 3 A's out of 7 grades. "Sys2 How?" is not included because it had less than 2 grades (N=1) so Sys2 has 2 A's out of the 3 grades for "Is?"

Your help is greatly appreciated.

Leo Leo
Answer

Try something like this:

df = pd.DataFrame({'system':['Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys2', 'Sys2', 'Sys2', 'Sys2'],
               'question': ['Is?', 'Is?', 'Is?', 'How?', 'How?', 'How?', 'How?', 'Is?', 'Is?', 'Is?', 'How?'],
               'grade': ['A', 'A', 'C', 'B', 'A', 'F', 'F', 'A', 'A', 'B', 'A']})

q_size_df = pd.DataFrame(df.groupby(['system', 'question']).size(), columns=['q_size']).reset_index()

df2 = df.merge(q_size_df)

df2[df2['q_size']>=2].groupby(['system']).apply(lambda x: len(x[x['grade']=='A']) / len(x))

Output is:

system
Sys1    0.428571
Sys2    0.666667
dtype: float64

The idea behind it is to first calculate the size metric in question, then merge that metric back to the original df, and finally calculate your statistic on a filtered df. The advantage of doing it this way is that more filtering columns (i.e. criteria) can be added in the future without having to implement additional logic outside of the filtering column itself.

Comments