ML_Pro ML_Pro - 15 days ago 12
Python Question

How to extract all string matches from a column using a input corpus/list in pandas?

For example I have the below list of strings as input corpus (actually its a big list with 100 values).
action=['jump','fly','run','swim']

Data contains a column called action_description. How can I extract all the string matches in the action_description using action list as input corpus?

Note: I have already done lemmitization description_action, so if the column have words like jumping or jumped its already converted to jump.

Sample input & output

"I love to run and while my friend prefer to swim" --> "run swim"
"Allan excels at high jump but he is not a good at running" --> "jump run"


Note: I found the below pandas function but its not well documentated so couldnt figure out how to use it.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extractall.html

Please recommend a optimal solution since by input dataframe have 200K rows.

EDIT
Words like jumper & runway should be ignore by the algorithm i.e. should not be classified as jump & run.

Answer

Steps:

  1. We perform lemmatization only on verbs by supplying pos='v' and let the nouns remain as they were before by iterating thorugh each word in that list got by str.split operation.
  2. Then, take all the matches of words present in the lookup list and the lemmatized list using set.
  3. Finally, join them to return string as the output.

from nltk.stem.wordnet import WordNetLemmatizer

action = ['jump','fly','run','swim']     # lookup list
lem = WordNetLemmatizer() 
fcn = lambda x: " ".join(set([lem.lemmatize(w, 'v') for w in x]).intersection(set(action)))
df['action_description'] = df['action_description'].str.split().apply(fcn)
df

enter image description here


Starting DF used:

df = pd.DataFrame(dict(action_description=["I love to run and while my friend prefer to swim", 
                                           "Allan excels at high jump but he is not a good at running"]))

To generate binary flags (0/1), we can use str.get_dummies method by splitting strings on whitespace and computing it's indicator variables as shown:

bin_flag = df['action_description'].str.get_dummies(sep=' ').add_suffix('_flag')
pd.concat([df['action_description'], bin_flag], axis=1)

enter image description here