piRSquared piRSquared - 6 months ago 30
Python Question

sort each column of correlation independently and get index values

I want to know the ordering from highest correlation to lowest correlation for each column in my

DataFrame
relative to every other column.

Example



np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(6, 4).round(2), columns=list('ABCD'))


Looks like:

A B C D
0 0.44 0.41 0.46 0.47
1 0.46 0.02 0.85 0.82
2 0.78 0.76 0.93 0.83
3 0.88 0.93 0.72 0.12
4 0.15 0.20 0.44 0.10
5 0.28 0.61 0.09 0.84


The correlation table looks like:

print df.corr()

A B C D
A 1.000000 0.702557 0.689214 0.005000
B 0.702557 1.000000 0.038306 -0.113245
C 0.689214 0.038306 1.000000 0.074773
D 0.005000 -0.113245 0.074773 1.000000


I want column
B
to be ordered
['B', 'A', 'C', 'D']
. A
DataFrame
should look like this?

A B C D
A A B C D
B B A A C
C C C D A
D D D B B


Also, I plan on doing this on an
DataFrames
from 20-40 thousand columns wide, so performance is an issue.

Answer

I think this would work:

In [11]:
df.corr().apply(lambda x: x.sort_values(ascending=False).index)

Out[11]:
   A  B  C  D
A  A  B  C  D
B  B  A  A  C
C  C  C  D  A
D  D  D  B  B

So this applies sort_values on each column and returns the index

Not sure how this will scale on a 40k column df but the problem is that currently sort_values on a df can only sort by a specific column or a row df-wide