Xiaomeng Xiaomeng - 2 months ago 17
Python Question

How to merge by a column of collection using Python Pandas?

I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:

|Id |Tag
| -------- | --------------------------------------------
|2 |c#,winforms,type-conversion,decimal,opacity


For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:

|Id |Tag
|----------|---------------------------
|3 |c#
|4 |winforms,type-conversion
|5 |winforms,c#


My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.

My question is:
1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags?
2. How to improve the efficiency of for loop search?

Answer Source

This can be achieved using df.join and df.groupby.

This is the setup I'm working with:

df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]}) 

   Id                                                Tag
0   2  [c#, winforms, type-conversion, decimal, opacity]

df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})  

   Id                          Tag
0   3                         [c#]
1   4  [winforms, type-conversion]
2   5               [winforms, c#]

Let's flatten out the right column in both data frames. This helped:

In [2331]: from itertools import chain

In [2332]: def flatten(df):
      ...:     return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
      ...:                          "Tag": list(chain.from_iterable(df.Tag))})
      ...: 

In [2333]: df1 = flatten(df1)

In [2334]: df2 = flatten(df2)

In [2335]: df1.head()
Out[2335]: 
   Id              Tag
0   2               c#
1   2         winforms
2   2  type-conversion
3   2          decimal
4   2          opacity

And similarly for df2, which is also flattened.

Now is the magic. We'll do a join on Tag column, and then groupby on joined IDs to find count of overlapping tags.

In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]: 
   Id_x  Id_y  Tag
0     2     3    1
1     2     4    2
2     2     5    2

The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby.

The df.count counts overlapping tags, and df.reset_index just prettifies the output, since groupby assigns the grouped column as the index, so we reset it.

To see matching tags, you'll modify the above slightly:

In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]: 
   Id_x  Id_y                          Tag
0     2     3                         [c#]
1     2     4  [winforms, type-conversion]
2     2     5               [c#, winforms]

To filter out 1-overlaps, chain a df.query call to the first expression:

In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]: 
   Id_x  Id_y  Tag
1     2     4    2
2     2     5    2