atonnerre - 1 year ago 170

Python Question

Suppose I have a dataframe that looks like the following:

`df = pd.DataFrame(columns = ["ID", "GROUP"])`

df["ID"] = ["a1", "a2", "a3", "a4", "a5", "a6"]

df["GROUP"] = [ ["g1", "g3"], ["g2", "g3", "g5"], ["g3", "g5"], ["g2"] , ["g1", "g5"], ["g3"]]

which gives:

`df`

ID GROUP

0 a1 [g1, g3]

1 a2 [g2, g3, g5]

2 a3 [g3, g5]

3 a4 [g2]

4 a5 [g1, g5]

5 a6 [g3]

and a list of groups as follows:

`GROUPS = ["g1", "g2", "g3", "g4", "g5", "g6"]`

Here is what I would like to obtain:

`groups_df`

g1 g2 g3 g4 g5 g6

g1 2 0 1 0 1 0

g2 0 2 1 0 1 0

g3 1 1 4 0 2 0

g4 0 0 0 0 0 0

g5 1 1 2 0 3 0

g6 0 0 0 0 0 0

which counts the number of times two groups appear in the same list (or how many IDs are present in both groups).

My code looks something like this:

`groups_df = pd.DataFrame(columns = GROUPS, index = GROUPS)`

for group1 in GROUPS:

for group2 in GROUPS:

groups_df.loc[group1, group2] = df[(df.GROUP.map(set) & {group1}) & (df.GROUP.map(set) & {group2})].shape[0]

It works but it is very slow with my actual data which consists of about 200000 rows in

`df`

`GROUPS`

The end goal is to use

`groups_df`

`NetworkX`

Can you think of a better way to achieve this ?

Thanks a lot for reading this and for any help !

EDIT 1:

Following a suggestion from @gboffi (https://stackoverflow.com/a/47477464/8115634), I ran the following:

`data = numpy.array(df.GROUP)`

items = GROUPS

sc = np.vectorize(list.__contains__)

t = sc(data[:, None], items)

groups_array = np.array([np.sum(t[t[:,i]], axis=0) for i in range(len(GROUPS))])

groups_df = pd.DataFrame(test, columns = GROUPS, index = GROUPS)

It was incredibly faster on the actual data: only 33 seconds! Thank you very much for your help.

Still, I will gladly try other suggestions for comparison.

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

I have this indirect way of solving your problem, that has the apparent benefit
of being `O(1)`

in Python, because the other loops (that are necessary) are performed by Numpy

Let's start with some fake data (no dataframes, just `ndarray`

s) in terms of a 10 row array of sets, of random length, containing integers from 0 to 4 inclusive

```
In [82]: import numpy as np
In [83]: import random
In [84]: items = np.arange(5)
In [85]: items
Out[85]: array([0, 1, 2, 3, 4])
In [86]: data = np.array([set(np.random.choice(items, random.randint(1, 5), False)) for _ in range(10)], dtype=set)
In [87]: data
Out[87]: array([{0, 2, 3}, {0, 1}, {2, 4}, {3, 4}, {3, 4}, {3, 4}, {0, 1, 2, 3, 4}, {3}, {2, 3, 4}, {1}], dtype=object)
```

Next, I transform this rather compact data into a boolean array

```
In [88]: sc = np.vectorize(set.__contains__)
In [89]: t = sc(data[:, None], items)
In [90]: t
Out[90]:
array([[ True, False, True, True, False],
[ True, True, False, False, False],
[False, False, True, False, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, False, True, True],
[ True, True, True, True, True],
[False, False, False, True, False],
[False, False, True, True, True],
[False, True, False, False, False]], dtype=bool)
```

I understand that, if your data is sparse, this can be a significant additional memory requirement, but this simplifies the next step

```
In [91]: np.array([np.sum(t[t[:,i]], axis=0) for i in items])
Out[91]:
array([[3, 2, 2, 2, 1],
[2, 3, 1, 1, 1],
[2, 1, 4, 3, 3],
[2, 1, 3, 7, 5],
[1, 1, 3, 5, 6]])
```

Here we sum the columns of `t`

(corresponding to different items) choosing only the rows where an item is present.

Two remarks

I think this should be faster than two explicit loops in Python, but I have not a benchmark, at least for now...

I have tried to vectorize the last loop using broadcasting but to no avail, if someone, starting from my answer, is going to remove the last loop I'll be glad to upvote if they post an answer on their own.

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**