captain ahab captain ahab - 7 months ago 548
Python Question

Faster way to rank rows in subgroups in pandas dataframe

I have a pandas data frame that has is composed of different subgroups.

df = pd.DataFrame({
'id':[1, 2, 3, 4, 5, 6, 7, 8],
'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'value':[.01, .4, .2, .3, .11, .21, .4, .01]
})


I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.

Right now I assess the ranks by:


  1. Sorting by value.

    df.sort('value', ascending = True, inplace=True)

  2. Create a ranker function (it assumes variables already sorted)

    def ranker(df):
    df['rank'] = np.arange(len(df)) + 1
    return df

  3. Apply the ranker function on each group separately:

    df = df.groupby(['group']).apply(ranker)



This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.

Answer

rank is cythonized so should be very fast. And you can pass the same options as df.rank() here are the docs for rank. As you can see, tie-breaks can be done in one of five different ways via the method argument.

Its also possible you simply want the .cumcount() of the group.

In [12]: df.groupby('group')['value'].rank(ascending=False)
Out[12]: 
0    4
1    1
2    3
3    2
4    3
5    2
6    1
7    4
dtype: float64
Comments