lev lev - 3 months ago 45
Python Question

Python, pandas: how to extract values from a symmetric, multi-index dataframe

I have a symmetric, multi-index dataframe from which I want to systematically extract data:

import pandas as pd

df_index = pd.MultiIndex.from_arrays(
[["A", "A", "B", "B"], [1, 2, 3, 4]], names = ["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 0.3, -0.4],
[0.5, 1.0, 0.9, -0.8],
[0.3, 0.9, 1.0, 0.1],
[-0.4, -0.8, 0.1, 1.0]],
index=df_index, columns=df_index)


I want a function
extract_vals
that can return all values related to elements in the same group, EXCEPT for the diagonal AND elements must not be double-counted. Here are two examples of the desired behavior (order does not matter):

A_vals = extract_vals("A", df) # [0.5, 0.3, -0.4, 0.9, -0.8]
B_vals = extract_vals("B", df) # [0.3, 0.9, 0.1, -0.4, -0.8]


My question is similar to this question on SO, but my situation is different because I am using a multi-index dataframe.

Finally, to make things more fun, please consider efficiency because I'll be running this many times on much bigger dataframes. Thanks very much!

EDIT:



Happy001's solution is awesome. I came up with a method myself based on the logic of extracting the elements where target is NOT in BOTH the rows and columns, and then extracting the lower triangle of those elements where target IS in BOTH the rows and columns. However, Happy001's solution is much faster.

First, I created a more complex dataframe to make sure both methods are generalizable:

import pandas as pd
import numpy as np

df_index = pd.MultiIndex.from_arrays(
[["A", "B", "A", "B", "C", "C"], [1, 2, 3, 4, 5, 6]], names=["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 1.0, -0.4, 1.1, -0.6],
[0.5, 1.0, 1.2, -0.8, -0.9, 0.4],
[1.0, 1.2, 1.0, 0.1, 0.3, 1.3],
[-0.4, -0.8, 0.1, 1.0, 0.5, -0.2],
[1.1, -0.9, 0.3, 0.5, 1.0, 0.7],
[-0.6, 0.4, 1.3, -0.2, 0.7, 1.0]],
index=df_index, columns=df_index)


Next, I defined both versions of extract_vals (the first is my own):

def extract_vals(target, multi_index_level_name, df):
# Extract entries where target is in the rows but NOT also in the columns
target_in_rows_but_not_in_cols_vals = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) != target]

# Extract entries where target is in the rows AND in the columns
target_in_rows_and_cols_df = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) == target]
mask = np.triu(np.ones(target_in_rows_and_cols_df.shape), k = 1).astype(np.bool)
vals_with_nans = target_in_rows_and_cols_df.where(mask).values.flatten()
target_in_rows_and_cols_vals = vals_with_nans[~np.isnan(vals_with_nans)]

# Append both arrays of extracted values
vals = np.append(target_in_rows_but_not_in_cols_vals, target_in_rows_and_cols_vals)

return vals

def extract_vals2(target, multi_index_level_name, df):
# Get indices for what you want to extract and then extract all at once
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i < j and (
df.index.get_level_values(multi_index_level_name)[i] == target or (
df.columns.get_level_values(multi_index_level_name)[j] == target))]

return df.values[tuple(np.transpose(coord))]


I checked that both functions returned output as desired:

# Expected values
e_A_vals = np.sort([0.5, 1.0, -0.4, 1.1, -0.6, 1.2, 0.1, 0.3, 1.3])
e_B_vals = np.sort([0.5, 1.2, -0.8, -0.9, 0.4, -0.4, 0.1, 0.5, -0.2])
e_C_vals = np.sort([1.1, -0.9, 0.3, 0.5, 0.7, -0.6, 0.4, 1.3, -0.2])

# Sort because order doesn't matter
assert np.allclose(np.sort(extract_vals("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals("C", "group", df)), e_C_vals)

assert np.allclose(np.sort(extract_vals2("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals2("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals2("C", "group", df)), e_C_vals)


And finally, I checked speed:

## Test speed
import time

# Method 1
start1 = time.time()
for ii in range(10000):
out = extract_vals("C", "group", df)
elapsed1 = time.time() - start1
print elapsed1 # 28.5 sec

# Method 2
start2 = time.time()
for ii in range(10000):
out2 = extract_vals2("C", "group", df)
elapsed2 = time.time() - start2
print elapsed2 # 10.9 sec

Answer

I don't assume df has the same columns and index. (Of course they can be the same).

def extract_vals(group_label, df):    
    coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i<j and (df.index.get_level_values('group')[i] == group_label or df.columns.get_level_values('group')[j] == group_label) ]
    return df.values[tuple(np.transpose(coord))]

print extract_vals('A', df)
print extract_vals('B', df)

result:

[ 0.5  0.3 -0.4  0.9 -0.8]
[ 0.3 -0.4  0.9 -0.8  0.1]
Comments