Vince - 1 year ago 78
Python Question

# How to find precision @ N using Pandas in Python

I'd like to calculate the proportion of A grades of systems but only include questions with a certain number of grades. I start with the following data

``````system     question    grade
Sys1       Is?         A
Sys1       Is?         A
Sys1       Is?         C
Sys1       How?        B
Sys1       How?        A
Sys1       How?        F
Sys1       How?        F
Sys2       Is?         A
Sys2       Is?         A
Sys2       Is?         B
Sys2       How?        A
``````

Precision = The proportion of A grades

N = Question has at least N grades in the system

I want to calculate the precision @ n for each system. For example, Precision @ 2 would be the proportion of 'A' grades for each system for questions with at least 2 grades

I'm struggling to bring calculations together as I'm just learning Pandas. Below is an example of what I'm trying to achieve using N=2:

1. Group by system, question and filter out groups with size < 2

```

size = df.groupby('system', 'question').size()
size[size >= 2]
system  question
Sys1    Is?    3
How?   4
Sys2    Is?    3

```

1. Get size (count) of each grade per system

```
Sys1      Is?         A     2
C     1
How?        A     1
B     1
F     2
Sys2     Is?          A     2
B     1
How?         A     1
```

2. Join the groups, only keeping the questions that weren't filtered out in step 1

`???`

3. Sum up all the A's per system and divide by the total number of grades in the system

`???`

Desired result:

``````system   precision
Sys1         0.43
Sys2         0.66
``````

Sys1 has a precision of 0.42 since it had 3 A's out of 7 grades. "Sys2 How?" is not included because it had less than 2 grades (N=1) so Sys2 has 2 A's out of the 3 grades for "Is?"

Try something like this:

``````df = pd.DataFrame({'system':['Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys2', 'Sys2', 'Sys2', 'Sys2'],
'question': ['Is?', 'Is?', 'Is?', 'How?', 'How?', 'How?', 'How?', 'Is?', 'Is?', 'Is?', 'How?'],
'grade': ['A', 'A', 'C', 'B', 'A', 'F', 'F', 'A', 'A', 'B', 'A']})

q_size_df = pd.DataFrame(df.groupby(['system', 'question']).size(), columns=['q_size']).reset_index()

df2 = df.merge(q_size_df)

``````system