CWlearner CWlearner - 6 months ago 73
Python Question

Find the column names which have top 3 largest values for each row

For example the data look like:

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}
x=pd.DataFrame(df)
Out[260]:
a1 a2 a3 a4 a5 a6
0 5 23 4 1 4 5
1 6 43 2 2 98 43
2 3 56 3 1 23 3
3 2 2 6 3 5 2
4 5 6 7 2 7 5


I need the result to look like:

top1 top2 top3
a2 a1 a6
a5 a2 a6
....


I've seen answer to a previous questions (see below) that recommends idxmax. But how to handle top n values (n>1)?

Find the column name which has maximum value for each row [pandas]

Update:

I find the answer very useful but the only thing is that my data is long so have to figure out a way to bypass that. I ended up saving the data to a csv file and then reading it back in in chunks. here is the code I used:

data = pd.read_csv('xxx.csv', chunksize=1000)
rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for chunk in data:
x=pd.DataFrame(chunk).T
for i in x.columns:
df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
rslt = pd.concat([rslt, df1row], axis=0)
rslt=rslt.reset_index(drop=True)

Answer

What you need is pandas.DataFrame.nlargest.

import pandas as pd
import numpy as np

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}

x=pd.DataFrame(df).T

rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for i in x.columns:
    df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
    rslt = pd.concat([rslt, df1row], axis=0)

print rslt

Out[52]: 
  top1 top2 top3
0   a2   a1   a6
0   a5   a2   a6
0   a2   a5   a1
0   a3   a5   a4
0   a3   a5   a2