Uri Laserson Uri Laserson - 1 year ago 151
Python Question

Using pandas, how do I subsample a large DataFrame by group in an efficient manner?

I am trying to subsample rows of a DataFrame according to a grouping. Here is an example. Say I define the following data:

from pandas import *
df = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
'value' : ["apple","pear","orange","apple",

If I group by
, then the number of rows in each group is here:

In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
a 1 2
2 1
3 2
4 1
b 1 2
2 2
3 1
4 1
c 3 1
4 1
5 2
6 1

(If there is an even more concise way to compute that, please tell.)

I now want to construct a DataFrame that has one randomly selected row from each group. My proposal is to do it like so:

In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x: df.reindex(index=[choice(range(len(x)))]))
In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
group1 group2 value
0 b 2 pear
1 a 1 apple
2 b 2 pear
3 a 1 apple
4 a 1 apple
5 a 1 apple
6 a 1 apple
7 a 1 apple
8 a 1 apple
9 a 1 apple
10 a 1 apple
11 a 1 apple

which works. However, my real data has about 2.5 million rows and 12 columns. If I do this the dirty way by building my own data structures, I can complete this operation in a matter of seconds. However, my implementation above does not finish within 30 minutes (and does not appear to be memory-limited). As a side note, when I tried implementing this in R, I first tried
, which also did not finish in a reasonable amount of time; however, a solution using
finished very rapidly.

How do I get this to work rapidly with
? I want to love this package, so please help!



Answer Source

I tested with apply, it seems that when there are many sub groups, it's very slow. the groups attribute of grouped is a dict, you can choice index directly from it:

subsampled = df.ix[(choice(x) for x in grouped.groups.itervalues())]

EDIT: As of pandas version 0.18.1, itervalues no longer works on groupby objects - you can just use .values:

subsampled = df.ix[(choice(x) for x in grouped.groups.values())]
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download