Melsauce - 1 year ago 120
Python Question

# How to create an edge list from pandas dataframe?

I have a pandas dataframe (df) of the form-

``````    Col1
A  [Green,Red,Purple]
B  [Red, Yellow, Blue]
C  [Brown, Green, Yellow, Blue]
``````

I need to convert this to an edge list i.e. a dataframe of the form:

``````Source    Target    Weight
A         B         1
A         C         1
B         C         2
``````

EDIT
Note that the new dataframe has rows equal to the total number of possible pairwise combinations. Also, to compute the 'Weight' column, we simply find the intersection between the two lists. For instance, for B&C, the elements share two colors: Blue and Yellow. Therefore, the 'Weight' for the corresponding row is 2.

What is the fastest way to do this? The original dataframe contains about 28,000 elements.

First, starting off with the dataframe:

``````In [823]: from itertools import combinations

In [824]: df = pd.DataFrame({'Col1': [['Green','Red','Purple'], ['Red', 'Yellow', 'Blue'], ['Brown', 'Green', 'Yellow', 'Blue']]}, index=['A',
...:  'B', 'C'])

In [827]: df['Col1'] = df.Col1.apply(lambda x: set(x))

In [828]: df
Out[828]:
Col1
A          {Purple, Red, Green}
B           {Red, Blue, Yellow}
C  {Green, Yellow, Blue, Brown}
``````

Each list in `Col1` has been converted into a set to find the union efficiently. Next, we'll use `itertools.combinations` to create pairwise combinations of all rows in `df`:

``````In [845]: df1 = pd.DataFrame(data=list(combinations(df.index.tolist(), 2)), columns=['Src', 'Dst'])

In [849]: df1
Out[849]:
Src Dst
0   A   B
1   A   C
2   B   C
``````

Now, apply a function to take the union of the sets and find its length. The `Src` and `Dst` columns act as a lookup into `df`.

``````In [859]: df1['Weights'] = df1.apply(lambda x: len(df.loc[x['Src']]['Col1'].intersection(df.loc[x['Dst']]['Col1'])), axis=1)

In [860]: df1
Out[860]:
Src Dst  Weights
0   A   B        1
1   A   C        1
2   B   C        2
``````

I advice set conversion at the very beginning. Converting your lists to a set each time on the fly is expensive and wasteful.

For more speedup, you'd probably want to also copy the sets into two columns in the new dataframe, as @Wen has done, because calling `df.loc` constantly will slow it down a notch.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download