Saff Saff - 2 months ago 8
Python Question

Working with the output of groupby and groupby.size()

I have a pandas dataframe containing a row for each object manipulated by participants during a user study. Each participant participates in the study 3 times, one in each of 3 conditions (

a
,
b
,
c
), working with around 300-700 objects in each condition.

When I report the number of objects worked with I want to make sure that this didn't vary significantly by condition (I don't expect it to have done, but need to confirm this statistically).

I think I want to run an ANOVA to compare the 3 conditions, but I can't figure out how to get the data I need for the ANOVA.

I currently have some pandas code to group the data and count the number of rows per participant per condition (so I can then use mean() and similar to summarise the data). An example with a subset of the data follows:

>>> tmp = df.groupby([FIELD_PARTICIPANT, FIELD_CONDITION]).size()
>>> tmp
participant_id condition
1 a 576
2 b 367
3 a 703
4 c 309
dtype: int64


To calculate the ANOVA I would normally just filter these by the condition column, e.g.

cond1 = tmp[tmp[FIELD_CONDITION] == CONDITION_A]
cond2 = tmp[tmp[FIELD_CONDITION] == CONDITION_B]
cond3 = tmp[tmp[FIELD_CONDITION] == CONDITION_C]
f_val, p_val = scipy.stats.f_oneway(cond1, cond2, cond3)


However, since
tmp
is a
Series
rather than the
DataFrame
I'm used to, I can't figure out how to achieve this in the normal way.

>>> tmp[FIELD_CONDITION]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 583, in __getitem__
result = self.index.get_value(self, key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 626, in get_value
raise e1
KeyError: 'condition'
>>> type(tmp)
<class 'pandas.core.series.Series'>
>>> tmp.index
MultiIndex(levels=[[u'1', u'2', u'3', u'4'], [u'd', u's']],
labels=[[0, 1, 2, 3], [0, 0, 0, 1]],
names=[u'participant_id', u'condition'])


I feel sure this is a straightforward problem to solve, but I can't seem to get there without some help :)

Answer

I think you need reset_index and then output is DataFrame:

tmp = df.groupby([FIELD_PARTICIPANT, FIELD_CONDITION]).size().reset_index(name='count')

Sample:

import pandas as pd

df = pd.DataFrame({'participant_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 3, 8: 4, 9: 4},
                   'condition': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b', 6: 'b', 7: 'a', 8: 'c', 9: 'c'}})
print (df)
  condition  participant_id
0         a               1
1         a               1
2         a               1
3         a               1
4         b               2
5         b               2
6         b               2
7         a               3
8         c               4
9         c               4

tmp = df.groupby(['participant_id', 'condition']).size().reset_index(name='count')
print (tmp)
   participant_id condition  count
0               1         a      4
1               2         b      3
2               3         a      1
3               4         c      2

If need working with Series you can use condition where select values of level condition of Multiindex by get_level_values:

tmp = df.groupby(['participant_id', 'condition']).size()
print (tmp)
participant_id  condition
1               a            4
2               b            3
3               a            1
4               c            2
dtype: int64

print (tmp.index.get_level_values('condition'))
Index(['a', 'b', 'a', 'c'], dtype='object', name='condition')

print (tmp.index.get_level_values('condition') == 'a')
[ True False  True False]

print (tmp[tmp.index.get_level_values('condition') == 'a'])
participant_id  condition
1               a            4
3               a            1
dtype: int64
Comments