 atonnerre - 3 years ago 224
Python Question

# Efficient way to create a symmetric matrix counting how often two strings belong to the same list

Suppose I have a dataframe that looks like the following:

``````df = pd.DataFrame(columns = ["ID", "GROUP"])
df["ID"] = ["a1", "a2", "a3", "a4", "a5", "a6"]
df["GROUP"] = [ ["g1", "g3"], ["g2", "g3", "g5"], ["g3", "g5"], ["g2"] , ["g1", "g5"], ["g3"]]
``````

which gives:

``````df

ID         GROUP
0     a1      [g1, g3]
1     a2  [g2, g3, g5]
2     a3      [g3, g5]
3     a4          [g2]
4     a5      [g1, g5]
5     a6          [g3]
``````

and a list of groups as follows:

``````GROUPS = ["g1", "g2", "g3", "g4", "g5", "g6"]
``````

Here is what I would like to obtain:

``````groups_df

g1 g2 g3 g4 g5 g6
g1  2  0  1  0  1  0
g2  0  2  1  0  1  0
g3  1  1  4  0  2  0
g4  0  0  0  0  0  0
g5  1  1  2  0  3  0
g6  0  0  0  0  0  0
``````

which counts the number of times two groups appear in the same list (or how many IDs are present in both groups).

My code looks something like this:

``````groups_df = pd.DataFrame(columns = GROUPS, index = GROUPS)

for group1 in GROUPS:

for group2 in GROUPS:

groups_df.loc[group1, group2] = df[(df.GROUP.map(set) & {group1}) & (df.GROUP.map(set) & {group2})].shape
``````

It works but it is very slow with my actual data which consists of about 200000 rows in
`df`
and about 760 different groups in
`GROUPS`
, and I guess my solution is not very clean.

The end goal is to use
`groups_df`
with
`NetworkX`
.

Can you think of a better way to achieve this ?

Thanks a lot for reading this and for any help !

EDIT 1:

Following a suggestion from @gboffi (https://stackoverflow.com/a/47477464/8115634), I ran the following:

``````data = numpy.array(df.GROUP)
items = GROUPS
sc = np.vectorize(list.__contains__)
t = sc(data[:, None], items)
groups_array = np.array([np.sum(t[t[:,i]], axis=0) for i in range(len(GROUPS))])
groups_df = pd.DataFrame(test, columns = GROUPS, index = GROUPS)
``````

It was incredibly faster on the actual data: only 33 seconds! Thank you very much for your help.

Still, I will gladly try other suggestions for comparison. gboffi

I have this indirect way of solving your problem, that has the apparent benefit of being `O(1)` in Python, because the other loops (that are necessary) are performed by Numpy

Let's start with some fake data (no dataframes, just `ndarray`s) in terms of a 10 row array of sets, of random length, containing integers from 0 to 4 inclusive

``````In : import numpy as np
In : import random
In : items = np.arange(5)
In : items
Out: array([0, 1, 2, 3, 4])
In : data = np.array([set(np.random.choice(items, random.randint(1, 5), False)) for _ in range(10)], dtype=set)
In : data
Out: array([{0, 2, 3}, {0, 1}, {2, 4}, {3, 4}, {3, 4}, {3, 4}, {0, 1, 2, 3, 4}, {3}, {2, 3, 4}, {1}], dtype=object)
``````

Next, I transform this rather compact data into a boolean array

``````In : sc = np.vectorize(set.__contains__)
In : t = sc(data[:, None], items)
In : t
Out:
array([[ True, False,  True,  True, False],
[ True,  True, False, False, False],
[False, False,  True, False,  True],
[False, False, False,  True,  True],
[False, False, False,  True,  True],
[False, False, False,  True,  True],
[ True,  True,  True,  True,  True],
[False, False, False,  True, False],
[False, False,  True,  True,  True],
[False,  True, False, False, False]], dtype=bool)
``````

I understand that, if your data is sparse, this can be a significant additional memory requirement, but this simplifies the next step

``````In : np.array([np.sum(t[t[:,i]], axis=0) for i in items])
Out:
array([[3, 2, 2, 2, 1],
[2, 3, 1, 1, 1],
[2, 1, 4, 3, 3],
[2, 1, 3, 7, 5],
[1, 1, 3, 5, 6]])
``````

Here we sum the columns of `t` (corresponding to different items) choosing only the rows where an item is present.

Two remarks

1. I think this should be faster than two explicit loops in Python, but I have not a benchmark, at least for now...

2. I have tried to vectorize the last loop using broadcasting but to no avail, if someone, starting from my answer, is going to remove the last loop I'll be glad to upvote if they post an answer on their own.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download