O.rka - 8 months ago 70

Python Question

I'm doing pairwise distance for something w/ a weird distance metric. I have a dictionary like

`{(key_A, key_B):distance_value}`

`pd.DataFrame`

What is the most efficient way to do this? I found one way but it doesn't seem like the best way to do this...

`NumPy`

`Pandas`

`1.46 ms per loop`

`np.random.seed(0)`

D_pair_value = dict()

for pair in itertools.combinations(list("ABCD"),2):

D_pair_value[pair] = np.random.randint(0,5)

D_pair_value

# {('A', 'B'): 4,

# ('A', 'C'): 0,

# ('A', 'D'): 3,

# ('B', 'C'): 3,

# ('B', 'D'): 3,

# ('C', 'D'): 1}

D_nested_dict = defaultdict(dict)

for (p,q), value in D_pair_value.items():

D_nested_dict[p][q] = value

D_nested_dict[q][p] = value

# Fill diagonal with zeros

DF = pd.DataFrame(D_nested_dict)

np.fill_diagonal(DF.values, 0)

DF

Answer

You can use `scipy.spatial.distance.squareform`

, which converts a vector of distance computations, i.e. `[d(A,B), d(A,C), ..., d(C,D)]`

, into the distance matrix you're looking for.

**Method 1: Distances Stored in a List**

If you're computing your distances in order, like in your example code and in my example distance vector, I'd avoid using a dictionary and just store the results in a list, and do something like:

```
from scipy.spatial.distance import squareform
df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))
```

**Method 2: Distances Stored in a Dictionary**

If you're computing things out of order and a dictionary is required, you just need to get a distance vector that's properly sorted:

```
from scipy.spatial.distance import squareform
dist_list = [dist[1] for dist in sorted(D_pair_value.items())]
df = pd.DataFrame(squareform(dist_list), index=list('ABCD'), columns=list('ABCD'))
```

**Method 3: Distances Stored in a Sorted Dictionary**

If a dictionary is required, note that there's a package called `sortedcontainers`

which has a `SortedDict`

that essentially would solve the sorting issue for you. To use it, all you'd need to change is initializing `D_pair_value`

as a `SortedDict()`

instead of a `dict`

. Using your example setup:

```
from scipy.spatial.distance import squareform
from sortedcontainers import SortedDict
np.random.seed(0)
D_pair_value = SortedDict()
for pair in itertools.combinations(list("ABCD"),2):
D_pair_value[pair] = np.random.randint(0,5)
df = pd.DataFrame(squareform(D_pair_value.values()), index=list('ABCD'), columns=list('ABCD'))
```

**The Resulting Output for Any Method Above:**

```
A B C D
A 0.0 4.0 0.0 3.0
B 4.0 0.0 3.0 3.0
C 0.0 3.0 0.0 1.0
D 3.0 3.0 1.0 0.0
```