ML_Pro ML_Pro - 1 year ago 72
Python Question

How to extract all string matches from a column using a input corpus/list in pandas?

For example I have the below list of strings as input corpus (actually its a big list with 100 values).

Data contains a column called action_description. How can I extract all the string matches in the action_description using action list as input corpus?

Note: I have already done lemmitization description_action, so if the column have words like jumping or jumped its already converted to jump.

Sample input & output

"I love to run and while my friend prefer to swim" --> "run swim"
"Allan excels at high jump but he is not a good at running" --> "jump run"

Note: I found the below pandas function but its not well documentated so couldnt figure out how to use it.

Please recommend a optimal solution since by input dataframe have 200K rows.

Words like jumper & runway should be ignore by the algorithm i.e. should not be classified as jump & run.

Answer Source


  1. We perform lemmatization only on verbs by supplying pos='v' and let the nouns remain as they were before by iterating thorugh each word in that list got by str.split operation.
  2. Then, take all the matches of words present in the lookup list and the lemmatized list using set.
  3. Finally, join them to return string as the output.

from nltk.stem.wordnet import WordNetLemmatizer

action = ['jump','fly','run','swim']     # lookup list
lem = WordNetLemmatizer() 
fcn = lambda x: " ".join(set([lem.lemmatize(w, 'v') for w in x]).intersection(set(action)))
df['action_description'] = df['action_description'].str.split().apply(fcn)

enter image description here

Starting DF used:

df = pd.DataFrame(dict(action_description=["I love to run and while my friend prefer to swim", 
                                           "Allan excels at high jump but he is not a good at running"]))

To generate binary flags (0/1), we can use str.get_dummies method by splitting strings on whitespace and computing it's indicator variables as shown:

bin_flag = df['action_description'].str.get_dummies(sep=' ').add_suffix('_flag')
pd.concat([df['action_description'], bin_flag], axis=1)

enter image description here

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download