motam79 motam79 - 1 month ago 9
Python Question

Pandas sorting within a group results in duplicated index

I have a pandas DataFrame in the following format:

C1 C2
A 0 1.764052 0.400157
1 0.978738 2.240893
2 1.867558 -0.977278
3 0.950088 -0.151357
4 -0.103219 0.410599
C 0 0.144044 1.454274
1 0.761038 0.121675
2 0.443863 0.333674
3 1.494079 -0.205158
4 0.313068 -0.854096


It is generated from this code:

import pandas as pd
import numpy as np
idx = pd.MultiIndex.from_product([['A','C'],range(5)])
np.random.seed(0)
df = pd.DataFrame(index=idx, data = np.random.randn(len(idx), 2), columns=['C1','C2'])


I would like to sort each group (of the first index level) separately by the column
C1
so I use the
groupby
function:

df.groupby(level=0).apply(lambda x: x.sort_values('C1'))

and I get the result:

C1 C2
A A 4 -0.103219 0.410599
3 0.950088 -0.151357
1 0.978738 2.240893
0 1.764052 0.400157
2 1.867558 -0.977278
C C 0 0.144044 1.454274
4 0.313068 -0.854096
2 0.443863 0.333674
1 0.761038 0.121675
3 1.494079 -0.205158


Why is the first level duplicated? Is there a better way to sort data within each group?

Answer

You need add parameter group_keys=False to groupby:

a = df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('C1'))
print (a)
           C1        C2
A 4 -0.103219  0.410599
  3  0.950088 -0.151357
  1  0.978738  2.240893
  0  1.764052  0.400157
  2  1.867558 -0.977278
C 0  0.144044  1.454274
  4  0.313068 -0.854096
  2  0.443863  0.333674
  1  0.761038  0.121675
  3  1.494079 -0.205158