user7140275 user7140275 - 6 days ago 5
Python Question

Sub-string match in word tokenizer

I have defined a function that returns me the sentences containing specified word from an excel file having a 'text' column.
And with the help of @Julien Marrec I redefined the function so that I could pass multiple words as argument as below

words = ['word1','word2','word3'.......]
df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if w.lower() in searched_words)])


But the problem is dataset is pretty huge(typically in GB's) and unstructured. Can someone suggest me how can I have a substring match to happen too i.e if a sentence has 'xxxxxword1yyyyy' my function should be able to return this sentence as well.

Answer

If you don't care about word boundaries, you can skip word tokenisation and just match with a regular expression.

However, this might give you a lot of matches that you didn't expect. For example, the search terms "tin" and "nation" will both match in the word "procrastination". If that is what you want, you can do the following:

import re

fsa = re.compile('|'.join(re.escape(w.lower()) for w in searched_words))
df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                               if fsa.search(sent)])

The re.compile() expression creates a regex pattern object, which consists simply of a set of alternatives. This allows you to scan through the complete sentence, looking out for all of the searched words at the same time.