kodachrome kodachrome - 3 years ago 154
Python Question

How do I rank only specific rows within a dataframe?

I have a dataframe that has the percent of people who engaged in a behavior, as well as the index to the general population. I'd like to rank these behaviors by index... but I want to exclude very low incidence behaviors from this ranking.

Easy to do if I just limited the entire dataframe to exclude these behaviors, but here's the catch--I still want to include them in my df; I just don't want to include them in the ranking.

My dataframe started out like this, with group as the index:

GROUP BEHAVIOR SUBBEHAVIOR PERCENT INDEX
GROUP1 behavior1 foo 0.521 123
GROUP1 behavior2 bar 0.013 213
GROUP1 behavior2 fuzz 1.034 103
GROUP1 behavior3 foobar 0.750 131
GROUP2 behavior1 foo 1.521 101
GROUP2 behavior2 bar 0.913 91
GROUP2 behavior2 fuzz 0.034 301
GROUP2 behavior3 foobar 0.950 87


But I don't want, say, that second row with 0.013 ranked. I want instead this:

GROUP BEHAVIOR SUBBEHAVIOR PERCENT INDEX RANK
GROUP1 behavior1 foo 0.521 123 2
GROUP1 behavior2 bar 0.013 213 NaN
GROUP1 behavior2 fuzz 1.034 103 3
GROUP1 behavior3 foobar 0.750 131 1
GROUP2 behavior1 foo 1.521 101 1
GROUP2 behavior2 bar 0.913 91 2
GROUP2 behavior2 fuzz 0.034 301 NaN
GROUP2 behavior3 foobar 0.950 87 3


I achieved the above by doing this:

filtered = df[fd.loc[:,'PERCENT']>0.05].copy()
filtered['RANK'] = filtered.groupby(level=0).rank(ascending=False)['INDEX']
final = df.merge(filtered, on=df.columns.tolist(), how='left')


(I have far more columns that make subbehavior truly unique, hence why I've passed all of df's columns.)

It works, but it feels like a very hamfisted approach, and I'm wondering if there is a better, more elegant way. Is there?

Answer Source

Two approaches

Using df.where:

Since NA groups are automatically excluded in groupby (see here in docs), and the default NA handling for rank is to just leave them as is, you can use df.where for your filter and proceed directly.

df['RANK'] = (df.where(df.PERCENT > .05)
                 .groupby('GROUP')
                 .rank(ascending=False)
                 .INDEX)

>>> df

         BEHAVIOR  INDEX  PERCENT SUBBEHAVIOR  RANK
GROUP                                              
GROUP1  behavior1    123    0.521         foo   2.0
GROUP1  behavior2    213    0.013         bar   NaN
GROUP1  behavior2    103    1.034        fuzz   3.0
GROUP1  behavior3    131    0.750      foobar   1.0
GROUP2  behavior1    101    1.521         foo   1.0
GROUP2  behavior2     91    0.913         bar   2.0
GROUP2  behavior2    301    0.034        fuzz   NaN
GROUP2  behavior3     87    0.950      foobar   3.0

Straightforward solution:

Alternatively, just reset the index on your DataFrame first so that you have an identifier to align the index with.

df.reset_index(inplace=True)
df['RANK'] = (df.query('PERCENT > .05')  # or using a mask ofc
                .groupby('GROUP')
                .rank(ascending=False)
                .INDEX)

The query can be replaced by a boolean indexing like df[df.PERCENT > .05] of course.

So then

>>> df.set_index('GROUP')

         BEHAVIOR SUBBEHAVIOR  PERCENT  INDEX  RANK
GROUP                                              
GROUP1  behavior1         foo    0.521    123   2.0
GROUP1  behavior2         bar    0.013    213   NaN
GROUP1  behavior2        fuzz    1.034    103   3.0
GROUP1  behavior3      foobar    0.750    131   1.0
GROUP2  behavior1         foo    1.521    101   1.0
GROUP2  behavior2         bar    0.913     91   2.0
GROUP2  behavior2        fuzz    0.034    301   NaN
GROUP2  behavior3      foobar    0.950     87   3.0
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download