Venu Venu - 2 months ago 7
R Question

performance issue while trying to match a list of words with a list of sentences in R

i am trying to match a list of words with a list of sentences and form a data frame with the matching words and sentences. For example:

words <- c("far better","good","great","sombre","happy")
sentences <- ("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")


The expected result (a dataframe) is as follows:

sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good, happy
This is how it works -


I am using the following code to acheive this.

lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
sentence <- sentence_df[i,]$comment
#sentence <- gsub('[[:punct:]]', "", sentence)
#sentence <- gsub('[[:cntrl:]]', "", sentence)
#sentence <- gsub('\\d+', "", sentence)
sentence <- tolower(sentence)
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

# .. and combine into data frame
words <- c(unigrams, bigrams)
#if(sentence_df[i,]$ave_sentiment)

pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)
pos.matches <- na.omit(pos.matches)
neg.matches <- na.omit(neg.matches)
positiveList <- pos.words[pos.matches]
negativeList <- neg.words[neg.matches]

if(length(positiveList)==0){
positiveList <- c("-")
}
if(length(negativeList)==0){
negativeList <- c("-")
}
negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")

positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))

}
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)


I have 28k sentences and 65k words to match with. The above code takes 45 seconds to accompolish the task. Any suggestions on how to improve the performance of the code as the current approach takes a lot of time?

Edit:
Missed the following when i posted the question.
I want to get only those words which exactly matches with the words in the sentences. For example :

words <- c('sin','vice','crashes')
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')


Now for the above case my output should be as follows:

sentences words
Since the app crashes frequently, I advice you guys to fix crahses
the issue ASAP

Answer

i was able to use @David Arenburg answer with some modification. Here is what i did. I used the following (suggested by David) to form the data frame.

df <- data.frame(sentences) ; 
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))

The problem with the above approach is that it does not do the exact word match. So I used the following to filter out the words that did not exactly match with the words in the sentence.

df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))

After applying the above line the output data frame changes as follows.

sentences                                                      words
This document is far better                                    better
This is a great app                                            great
The night skies were sombre and starless                       sombre 
The app is too good and i am happy using it                    good
The app is too good and i am happy using it                    happy
This is how it works                                            -
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 crahses
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 vice
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 sin

Now apply the following filter to the data frame to remove those words that are not an exact match to those words present in the sentence.

df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]

Now my resulting data frame will be as follows.

    sentences                                                      words
    This document is far better                                    better
    This is a great app                                            great
    The night skies were sombre and starless                       sombre 
    The app is too good and i am happy using it                    good
    The app is too good and i am happy using it                    happy
    This is how it works                                            -
    Since the app crashes frequently, I advice you guys to fix     
    the issue ASAP                                                 crahses

stri_detect_fixed reduced my computation time a lot. The remaining process did not take up much time. Thanks to @David for pointing me out in the right direction.