Thomas Matthew Thomas Matthew - 2 months ago 24
Python Question

Congruency Table in Pandas (Pearson Correlation between each row for every row pair)

With the pandas dataframe below, taken from a dict of dict:

import numpy as np
import pandas as pd
from scipy.stats import pearsonr

NaN = np.nan

dd ={'A': {'A': '1', 'B': '2', 'C': '3'},
'B': {'A': '4', 'B': '5', 'C': '6'},
'C': {'A': '7', 'B': '8', 'C': '9'}}

df_link_link = pd.DataFrame.from_dict(dd, orient='index')

I'd like to form a new pandas DataFrame with the results of a Pearson correlation between rows for every row, excluding Pearson correlations between the same rows (correlating A with itself should just be
. This is spelled out as a dict of dicts here:

dict_congruent = {'A': {'A': NaN,
'B': pearsonr([NaN,2,3],[4,5,6]),
'C': pearsonr([NaN,2,3],[7,8,9])},
'B': {'A': pearsonr([4,NaN,6],[1,2,3]),
'B': NaN,
'C': pearsonr([4,NaN,6],[7,8,9])},
'C': {'A': pearsonr([7,8,NaN],[1,2,3]),
'B': pearsonr([7,8,NaN],[4,5,6]),
'C': NaN }}

is just
. Is there a way to do this as an operation within pandas without iterating through a dict of dicts? I have ~76million pairs, so a non-iterative approach would be great, if one exists.


Canonical but not viable solution

df.corr().mask(np.equal.outer(df.index.values, df.columns.values))

default method for corr is pearson.

enter image description here

Transpose Your Data To Use This
Wrapped up with a bow

very wide data

m, n = 1000, 10000
df = pd.DataFrame(np.random.randn(m, n), columns=['s{}'.format(i) for i in range(n)])

magic function

def corr(df, step=100, mask_diagonal=False):

    def corr_closure(df):
        d = df.values
        sums = d.sum(0, keepdims=True)
        stds = d.std(0, keepdims=True)
        n = d.shape[0]

        def corr_(k=0, l=10):
            d2 =[:, k:l])
            sums2 =[:, k:l])
            stds2 =[:, k:l])

            return pd.DataFrame((d2 - sums2 / n) / stds2 / n,
                                df.columns, df.columns[k:l])

        return corr_

    c = corr_closure(df)

    step = min(step, df.shape[1])

    tups = zip(range(0, n, step), range(step, n + step, step))

    corr_table = pd.concat([c(*t) for t in tups], axis=1)

    if mask_diagonal:
        np.fill_diagonal(corr_table.values, np.nan)

    return corr_table


ct = corr(df, mask_diagonal=True)
ct.iloc[:10, :10]

enter image description here

Magic Solution Explained

  • use a closure to pre-calculate column sums and standard deviations
  • return a function that takes positions of columns over which to correlate

def corr_closure(df):
    d = df.values  # get underlying numpy array
    sums = d.sum(0, keepdims=True)  # pre calculate sums
    stds = d.std(0, keepdims=True)  # pre calculate standard deviations
    n = d.shape[0]  # grab number of rows

    def corr(k=0, l=10):
        d2 =[:, k:l])  # for this slice, run dot product
        sums2 =[:, k:l])  # dot pre computed sums with slice
        stds2 =[:, k:l])  # dot pre computed stds with slice

        # calculate correlations with the information I have
        return pd.DataFrame((d2 - sums2 / n) / stds2 / n,
                            df.columns, df.columns[k:l])

    return corr

10 columns
enter image description here

100 columns
enter image description here

1000 columns
enter image description here

10000 columns
df.corr() did not finish in a reasonable amount of time
enter image description here