O.rka O.rka - 3 months ago 29
Python Question

Most efficient way to turn dictionary into symmetric/distance matrix (Python | Pandas)

I'm doing pairwise distance for something w/ a weird distance metric. I have a dictionary like

{(key_A, key_B):distance_value}
and I want to make a symmetric
pd.DataFrame
like a distance matrix.

What is the most efficient way to do this? I found one way but it doesn't seem like the best way to do this... Is there anything in
NumPy
or
Pandas
that does this type of operation?
or just a quicker way? My way is
1.46 ms per loop


np.random.seed(0)
D_pair_value = dict()
for pair in itertools.combinations(list("ABCD"),2):
D_pair_value[pair] = np.random.randint(0,5)
D_pair_value
# {('A', 'B'): 4,
# ('A', 'C'): 0,
# ('A', 'D'): 3,
# ('B', 'C'): 3,
# ('B', 'D'): 3,
# ('C', 'D'): 1}
D_nested_dict = defaultdict(dict)
for (p,q), value in D_pair_value.items():
D_nested_dict[p][q] = value
D_nested_dict[q][p] = value

# Fill diagonal with zeros
DF = pd.DataFrame(D_nested_dict)
np.fill_diagonal(DF.values, 0)
DF


enter image description here

Answer

You can use scipy.spatial.distance.squareform, which converts a vector of distance computations, i.e. [d(A,B), d(A,C), ..., d(C,D)], into the distance matrix you're looking for.

Method 1: Distances Stored in a List

If you're computing your distances in order, like in your example code and in my example distance vector, I'd avoid using a dictionary and just store the results in a list, and do something like:

from scipy.spatial.distance import squareform

df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))

Method 2: Distances Stored in a Dictionary

If you're computing things out of order and a dictionary is required, you just need to get a distance vector that's properly sorted:

from scipy.spatial.distance import squareform

dist_list = [dist[1] for dist in sorted(D_pair_value.items())]
df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))

Method 3: Distances Stored in a Sorted Dictionary

If a dictionary is required, note that there's a package called sortedcontainers which has a SortedDict that essentially would solve the sorting issue for you. To use it, all you'd need to change is initializing D_pair_value as a SortedDict() instead of a dict. Using your example setup:

from scipy.spatial.distance import squareform
from sortedcontainers import SortedDict

np.random.seed(0)
D_pair_value = SortedDict()
for pair in itertools.combinations(list("ABCD"),2):
    D_pair_value[pair] = np.random.randint(0,5)

df = pd.DataFrame(squareform(D_pair_value.values()), index=list('ABCD'), columns=list('ABCD'))

The Resulting Output for Any Method Above:

     A    B    C    D
A  0.0  4.0  0.0  3.0
B  4.0  0.0  3.0  3.0
C  0.0  3.0  0.0  1.0
D  3.0  3.0  1.0  0.0
Comments