user1043144 - 2 years ago 221
Python Question

# Frequency tables in pandas (like plyr in R)

My problem is how to calculate frequencies on multiple variables in pandas .
I have from this dataframe :

``````d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6",   "x7",     "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
``````

To the following result

``````             Participated  OfWhichpassed
ExamenYear
2007                   3              2
2008                   4              3
2009                   3              2
``````

(1) One possibility I tried is to compute two dataframe and bind them

``````t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)

Res1 = tx['yes']
``````

(2) The second possibility is to use an aggregation function .

``````import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})

Res2.columns = ['Participated', 'OfWhichpassed']
``````

Both ways are awckward to say the least.
How is this done properly in pandas ?

P.S: I also tried *value_counts* instead of collections.Counter but could not get it to work

For reference: Few months ago, I asked similar question for R here and plyr could help

---- UPDATE ------

user DSM is right. there was a mistake in the desired table result.

(1) The code for option one is

`````` t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)

Res1 = pd.DataFrame( {'All': t1,
'OfWhichParticipated': t2['yes'],
'OfWhichPassed': t3['yes']})
``````

It will produce the result

``````             All  OfWhichParticipated  OfWhichPassed
ExamenYear
2007          3                    2              2
2008          4                    3              3
2009          3                    3              2
``````

(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be

``````Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: x.value_counts()['yes']})

Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']
``````

which will produce the same result as Res1

My question remains though:

Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?

---- A NEW UPDATE ----

I have finally decided to use apply which I understand is more flexible.

``````d1.groupby('ExamenYear').agg({'Participated': len,