not_a_robot not_a_robot - 2 months ago 19
Python Question

Check if string in one column contained by another?

I am running a nested cluster analysis on some text data, using topic modeling approaches. In these approaches, overlap between "themes" (or topics) is allowed, to a certain extent. As such, I have some

s that have been grouped into multiple topics, while others have only been sent to one topic.

If you look below, you'll notice that several values for
have been "misplaced," so to speak (i.e. the value for
for that row indicates a different "level 2" theme than does the value for
. I'm trying to drop all rows in which the string in
does not contain the string in
, but am having some difficulties. See the code below.

In [25]: df
identifier level_1_id level_2_id level_3_id
15293 1270599 level_1 level_12 level_111
15294 1270599 level_1 level_12 level_122
15295 1270599 level_1 level_11 level_111
15296 1270599 level_1 level_11 level_122
15704 1269594 level_1 level_11 level_143
15705 1269594 level_1 level_11 level_111
15706 1269594 level_1 level_14 level_143
15707 1269594 level_1 level_14 level_111
29843 1271580 level_1 level_10 level_100

I'm trying to do something to this effect (I know it doesn't work, but I hope the general idea of vectorizing this string-containment checking processs is conveyed):

import pandas as pd

pairs = list(zip(df['level_2_id'], df['level_3_id']))
new_data = [y for x, y in pairs if x in y]

This basically gives me what I want, but is there a way to do this in
which avoids having to assign a shorter list to a longer

Something like this, maybe?



A correct answer, after @Noobie's help below:

df[df.apply(axis = 1, func = lambda x: x['level_2_id'] in x['level_3_id'])]

df['level_3_id'].apply(lambda x: x['level_3_id'] in x['level_2_id'].tolist())