Antonio López Ruiz Antonio López Ruiz - 3 months ago 10
Python Question

Selecting highest rows on matrix pandas python.

I have the following data:

https://github.com/antonio1695/Python/blob/master/nearBPO/facturasb.csv

It is a matrix like the following example:

UUID A B C D E F G H I
1.1 0 1 0 0 0 1 0 0 0
1.2 1 1 0 0 0 0 0 0 0
1.3 0 0 0 0 1 0 0 0 0
1.4 0 0 0 1 0 1 1 1 1
1.5 0 1 0 0 0 0 1 0 0
1.6 0 0 1 0 0 0 1 0 0
1.7 0 1 0 0 0 0 0 1 0
1.8 0 0 1 0 0 0 1 0 0
1.9 0 1 0 0 0 0 1 0 1


I would like to make a new matrix with only the 50 highest columns (3 in the example) and it's respective UUID. With the highest columns i mean those columns that have more 1's in the matrix.

If i'm not clear enough, don't hesitate asking. Thank you.

Answer

IIUC

df[df.sum().nlargest(3).index]

enter image description here


To exclude rows with all zeros among the n largest

n = df.sum().nlargest(3).index
df1 = df.loc[:, n]
df1[df1.eq(1).any(1)]

enter image description here


Setup

from StringIO import StringIO
import pandas as pd

text = """UUID  A   B   C   D   E   F   G   H   I  
1.1   0   1   0   0   0   1   0   0   0
1.2   1   1   0   0   0   0   0   0   0
1.3   0   0   0   0   1   0   0   0   0
1.4   0   0   0   1   0   1   1   1   1
1.5   0   1   0   0   0   0   1   0   0
1.6   0   0   1   0   0   0   1   0   0 
1.7   0   1   0   0   0   0   0   1   0 
1.8   0   0   1   0   0   0   1   0   0
1.9   0   1   0   0   0   0   1   0   1"""

df = pd.read_csv(StringIO(text), index_col=0, delim_whitespace=True)

Bonus solution with numpy

Assuming same setup (this is probably quicker)

n = df.values.sum(0).argsort()[-3:][::-1]
m = (a[:, n] == 1).any(1)

df.iloc[m, n]

Notice the columns are not the same as my other solution. That is because the multiple columns summed to the same value.

enter image description here


Timing

enter image description here