moustachio moustachio - 8 months ago 15
Python Question

Removing extraneous index for groupby-apply functions that generate dataframes

This is a problem I've encountered in various contexts, and I'm curious if I'm doing something wrong, or if my whole approach is off. The particular data/functions are not important here, but I'll include a concrete example in any case.

It's not uncommon to want a groupby/apply that does various operations on each group, and returns a new dataframe. An example might be something like this:

def patch_stats(df):
first = df.iloc[0]
diversity = (len(df['artist_id'].unique())/float(len(df))) * df['dist'].mean()
start = first['ts']
return pd.DataFrame({'diversity':[diversity],'start':[start]})

So, this is a grouping function that generates a new DataFrame with two columns, each derived from a different operation on the input data. Again, the specifics aren't too important here, but this is the issue:

When I look at the output, I get something like this:

result = df.groupby('patch_idx').apply(patch_stats)
print result

diversity start
0 0 0.876161 2007-02-24 22:54:28
1 0 0.588997 2007-02-25 01:55:39
2 0 0.655306 2007-02-25 04:27:05
3 0 0.986047 2007-02-25 05:37:58
4 0 0.997020 2007-02-25 06:27:08
5 0 0.639499 2007-02-25 17:40:56
6 0 0.687874 2007-02-26 05:24:11
7 0 0.003714 2007-02-26 07:07:20
8 0 0.065533 2007-02-26 09:01:11
9 0 0.000000 2007-02-26 19:23:52
10 0 0.068846 2007-02-26 20:43:03

It's all good, except I have an extraneous, unnamed index level that I don't want:

print result.index.names

FrozenList([u'patch_idx', None])

Now, this isn't a huge deal; I can always get rid of the extraneous index level with something like:

result = result.reset_index(level=1,drop=True)

But seeing how this comes up anytime I have grouping function that returns a DataFrame, I'm wondering if there's a better approach to how I'm approaching this. Is it bad form to have a grouping function that returns a DataFrame? If so, what's the right method to get the same kind of result? (again, this is a general question fitting problems of this type)


In you grouping function, return a Series instead of a DataFrame. Specifically, replace the last line of patch_stats with:

return pd.Series({'diversity':diversity, 'start':start})