atonnerre atonnerre - 1 year ago 95
Python Question

Efficient way to create a symmetric matrix counting how often two strings belong to the same list

Suppose I have a dataframe that looks like the following:

df = pd.DataFrame(columns = ["ID", "GROUP"])
df["ID"] = ["a1", "a2", "a3", "a4", "a5", "a6"]
df["GROUP"] = [ ["g1", "g3"], ["g2", "g3", "g5"], ["g3", "g5"], ["g2"] , ["g1", "g5"], ["g3"]]


which gives:

df

ID GROUP
0 a1 [g1, g3]
1 a2 [g2, g3, g5]
2 a3 [g3, g5]
3 a4 [g2]
4 a5 [g1, g5]
5 a6 [g3]


and a list of groups as follows:

GROUPS = ["g1", "g2", "g3", "g4", "g5", "g6"]


Here is what I would like to obtain:

groups_df

g1 g2 g3 g4 g5 g6
g1 2 0 1 0 1 0
g2 0 2 1 0 1 0
g3 1 1 4 0 2 0
g4 0 0 0 0 0 0
g5 1 1 2 0 3 0
g6 0 0 0 0 0 0


which counts the number of times two groups appear in the same list (or how many IDs are present in both groups).

My code looks something like this:

groups_df = pd.DataFrame(columns = GROUPS, index = GROUPS)

for group1 in GROUPS:

for group2 in GROUPS:

groups_df.loc[group1, group2] = df[(df.GROUP.map(set) & {group1}) & (df.GROUP.map(set) & {group2})].shape[0]


It works but it is very slow with my actual data which consists of about 200000 rows in
df
and about 760 different groups in
GROUPS
, and I guess my solution is not very clean.

The end goal is to use
groups_df
with
NetworkX
.

Can you think of a better way to achieve this ?

Thanks a lot for reading this and for any help !

EDIT 1:

Following a suggestion from @gboffi (https://stackoverflow.com/a/47477464/8115634), I ran the following:

data = numpy.array(df.GROUP)
items = GROUPS
sc = np.vectorize(list.__contains__)
t = sc(data[:, None], items)
groups_array = np.array([np.sum(t[t[:,i]], axis=0) for i in range(len(GROUPS))])
groups_df = pd.DataFrame(test, columns = GROUPS, index = GROUPS)


It was incredibly faster on the actual data: only 33 seconds! Thank you very much for your help.

Still, I will gladly try other suggestions for comparison.

Answer Source

I have this indirect way of solving your problem, that has the apparent benefit of being O(1) in Python, because the other loops (that are necessary) are performed by Numpy

Let's start with some fake data (no dataframes, just ndarrays) in terms of a 10 row array of sets, of random length, containing integers from 0 to 4 inclusive

In [82]: import numpy as np
In [83]: import random
In [84]: items = np.arange(5)
In [85]: items
Out[85]: array([0, 1, 2, 3, 4])
In [86]: data = np.array([set(np.random.choice(items, random.randint(1, 5), False)) for _ in range(10)], dtype=set)
In [87]: data
Out[87]: array([{0, 2, 3}, {0, 1}, {2, 4}, {3, 4}, {3, 4}, {3, 4}, {0, 1, 2, 3, 4}, {3}, {2, 3, 4}, {1}], dtype=object)

Next, I transform this rather compact data into a boolean array

In [88]: sc = np.vectorize(set.__contains__)
In [89]: t = sc(data[:, None], items)
In [90]: t
Out[90]: 
array([[ True, False,  True,  True, False],
       [ True,  True, False, False, False],
       [False, False,  True, False,  True],
       [False, False, False,  True,  True],
       [False, False, False,  True,  True],
       [False, False, False,  True,  True],
       [ True,  True,  True,  True,  True],
       [False, False, False,  True, False],
       [False, False,  True,  True,  True],
       [False,  True, False, False, False]], dtype=bool)

I understand that, if your data is sparse, this can be a significant additional memory requirement, but this simplifies the next step

In [91]: np.array([np.sum(t[t[:,i]], axis=0) for i in items])
Out[91]: 
array([[3, 2, 2, 2, 1],
       [2, 3, 1, 1, 1],
       [2, 1, 4, 3, 3],
       [2, 1, 3, 7, 5],
       [1, 1, 3, 5, 6]])

Here we sum the columns of t (corresponding to different items) choosing only the rows where an item is present.

Two remarks

  1. I think this should be faster than two explicit loops in Python, but I have not a benchmark, at least for now...

  2. I have tried to vectorize the last loop using broadcasting but to no avail, if someone, starting from my answer, is going to remove the last loop I'll be glad to upvote if they post an answer on their own.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download