Vince - 3 months ago 15

Python Question

I'd like to calculate the proportion of A grades of systems but only include questions with a certain number of grades. I start with the following data

`system question grade`

Sys1 Is? A

Sys1 Is? A

Sys1 Is? C

Sys1 How? B

Sys1 How? A

Sys1 How? F

Sys1 How? F

Sys2 Is? A

Sys2 Is? A

Sys2 Is? B

Sys2 How? A

Precision = The proportion of A grades

N = Question has at least N grades in the system

I want to calculate the precision @ n for each system. For example,

I'm struggling to bring calculations together as I'm just learning Pandas. Below is an example of what I'm trying to achieve using N=2:

- Group by system, question and filter out groups with size < 2

size = df.groupby('system', 'question').size()

size[size >= 2]

system question

Sys1 Is? 3

How? 4

Sys2 Is? 3

- Get size (count) of each grade per system

df.groupby(['system', 'question', 'grade']).size()

system question grade

Sys1 Is? A 2

C 1

How? A 1

B 1

F 2

Sys2 Is? A 2

B 1

How? A 1

- Join the groups, only keeping the questions that weren't filtered out in step 1

`???`

- Sum up all the A's per system and divide by the total number of grades in the system

`???`

Desired result:

`system precision`

Sys1 0.43

Sys2 0.66

Sys1 has a precision of 0.42 since it had 3 A's out of 7 grades. "Sys2 How?" is not included because it had less than 2 grades (N=1) so Sys2 has 2 A's out of the 3 grades for "Is?"

Your help is greatly appreciated.

Answer

Try something like this:

```
df = pd.DataFrame({'system':['Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys1', 'Sys2', 'Sys2', 'Sys2', 'Sys2'],
'question': ['Is?', 'Is?', 'Is?', 'How?', 'How?', 'How?', 'How?', 'Is?', 'Is?', 'Is?', 'How?'],
'grade': ['A', 'A', 'C', 'B', 'A', 'F', 'F', 'A', 'A', 'B', 'A']})
q_size_df = pd.DataFrame(df.groupby(['system', 'question']).size(), columns=['q_size']).reset_index()
df2 = df.merge(q_size_df)
df2[df2['q_size']>=2].groupby(['system']).apply(lambda x: len(x[x['grade']=='A']) / len(x))
```

Output is:

```
system
Sys1 0.428571
Sys2 0.666667
dtype: float64
```

The idea behind it is to first calculate the size metric in question, then merge that metric back to the original df, and finally calculate your statistic on a filtered df. The advantage of doing it this way is that more filtering columns (i.e. criteria) can be added in the future without having to implement additional logic outside of the filtering column itself.