not_a_robot not_a_robot - 24 days ago 10
Python Question

Check if string in one column contained by another?

I am running a nested cluster analysis on some text data, using topic modeling approaches. In these approaches, overlap between "themes" (or topics) is allowed, to a certain extent. As such, I have some

identifier
s that have been grouped into multiple topics, while others have only been sent to one topic.

If you look below, you'll notice that several values for
level_3_id
have been "misplaced," so to speak (i.e. the value for
level_2_id
for that row indicates a different "level 2" theme than does the value for
level_3_id
. I'm trying to drop all rows in which the string in
df['level_3_id']
does not contain the string in
df['level_2_id']
, but am having some difficulties. See the code below.

In [25]: df
Out[25]:
identifier level_1_id level_2_id level_3_id
15293 1270599 level_1 level_12 level_111
15294 1270599 level_1 level_12 level_122
15295 1270599 level_1 level_11 level_111
15296 1270599 level_1 level_11 level_122
15704 1269594 level_1 level_11 level_143
15705 1269594 level_1 level_11 level_111
15706 1269594 level_1 level_14 level_143
15707 1269594 level_1 level_14 level_111
29843 1271580 level_1 level_10 level_100


I'm trying to do something to this effect (I know it doesn't work, but I hope the general idea of vectorizing this string-containment checking processs is conveyed):

import pandas as pd

pairs = list(zip(df['level_2_id'], df['level_3_id']))
new_data = [y for x, y in pairs if x in y]


This basically gives me what I want, but is there a way to do this in
pandas
which avoids having to assign a shorter list to a longer
df
?

Something like this, maybe?

df['level_3_id'].str.contains(df['level_2_id'].str)


Edit:

A correct answer, after @Noobie's help below:

df[df.apply(axis = 1, func = lambda x: x['level_2_id'] in x['level_3_id'])]

Answer
df['level_3_id'].apply(lambda x: x['level_3_id'] in x['level_2_id'].tolist())
Comments