user7140275 user7140275 - 9 days ago 6
Python Question

Extracting sentences using pandas with specific words

I have a excel file with a text column. All I need to do is to extract the sentences from the text column for each row with specific words.

I have tried using defining a function.

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
sentences=sent_tokenize(text)
return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")


But can someone please help me if I have to find the sentence with multiple specific words like
snakes
,
venomous
,
anaconda
. The sentence should have at least one word. I am not able to work around with
nltk.tokenize
with multiple words.

To be searched
words = ['snakes','venomous','anaconda']


Input Excel file :

text
1. Snakes are venomous. Anaconda is venomous.
2. Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
3. Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an anaconda.Because it is venomous.
4. Python is dangerous too.


Desired Output :

Column called Context appended to the text column above. Context column should be like :

1. [Snakes are venomous.] [Anaconda is venomous.]
2. [Anaconda lives in Amazon.] [It is venomous.]
3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.]
4. NULL


Thanks in advance.

Answer

Here's how:

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object

You see that there's a couple of issues, because the sent_tokenizer didn't do it's job properly because of the punctuation.