balakishore nadella balakishore nadella - 1 year ago 39
Python Question

group rows if atleast one word overlaps with other in a dataframe column

I have a data frame as below

words group_id
0 set([a, c, b, d]) 1
1 set([a, b]) 2
2 set([h, e, g, f]) 3

I need to group the rows into one even if one word in the set(words) overlaps with a word in set of other row and update the group_id.

words group_id
0 set([a, c, b, d]) 1
1 set([a, b]) 1
2 set([h, e, g, f]) 3

I tried this way

word_frequency = Counter()

for val in df['words'].values:

to_return = np.array(word_frequency.most_common())
count = 1

df['group_id'] = np.zeros(len(df)) * np.nan
for val in to_return:
df['group_id'] = df[['group_id','words']].apply(lambda x: count if (val in x) else np.NAN)
count += 1

How can I do that?

Answer Source

This works but it's pretty inefficient since its producing a set of unique groupings then searching through this set of unique groupings once for each entry in the dataframe. Would be neat to see more efficient ways of doing this.

def unique_grouper(series_of_entry_sets):
    set_of_groups = [series_of_entry_sets[0]]
    for potential_set in series_of_entry_sets:
        for i,accepted_set in enumerate(set_of_groups, start = 1):
            if accepted_set & potential_set:
    return set_of_groups

def group_identifier(current_set,set_of_groups):
    for i,unique_group in enumerate(set_of_groups):
        if current_set & unique_group:
            return i
    return None

df = pd.DataFrame({"Names":[set(["a", "c", "b", "d"]),set(["a", "b"]),set(["h", "e", "g", "f"]),set(["z"])]})
result =unique_grouper(df.Names)
df["group id"] = df.Names.apply(lambda x:group_identifier(x,result))


          Names  group id
0  {a, c, b, d}         0
1        {a, b}         0
2  {h, e, g, f}         1
3           {z}         2