piRSquared - 1 year ago 96
Python Question

# sort each column of correlation independently and get index values

I want to know the ordering from highest correlation to lowest correlation for each column in my

`DataFrame`
relative to every other column.

### Example

``````np.random.seed([3,1415])

df = pd.DataFrame(np.random.rand(6, 4).round(2), columns=list('ABCD'))
``````

Looks like:

``````      A     B     C     D
0  0.44  0.41  0.46  0.47
1  0.46  0.02  0.85  0.82
2  0.78  0.76  0.93  0.83
3  0.88  0.93  0.72  0.12
4  0.15  0.20  0.44  0.10
5  0.28  0.61  0.09  0.84
``````

The correlation table looks like:

``````print df.corr()

A         B         C         D
A  1.000000  0.702557  0.689214  0.005000
B  0.702557  1.000000  0.038306 -0.113245
C  0.689214  0.038306  1.000000  0.074773
D  0.005000 -0.113245  0.074773  1.000000
``````

I want column
`B`
to be ordered
`['B', 'A', 'C', 'D']`
. A
`DataFrame`
should look like this?

``````   A  B  C  D
A  A  B  C  D
B  B  A  A  C
C  C  C  D  A
D  D  D  B  B
``````

Also, I plan on doing this on an
`DataFrames`
from 20-40 thousand columns wide, so performance is an issue.

Answer Source

I think this would work:

``````In [11]:
df.corr().apply(lambda x: x.sort_values(ascending=False).index)

Out[11]:
A  B  C  D
A  A  B  C  D
B  B  A  A  C
C  C  C  D  A
D  D  D  B  B
``````

So this applies `sort_values` on each column and returns the index

Not sure how this will scale on a 40k column df but the problem is that currently `sort_values` on a df can only sort by a specific column or a row df-wide

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download