KarlTMuahaha T KarlTMuahaha T - 1 year ago 55
R Question

Extract terms, which are in a list, from text field in R

I have a question regarding on string extraction.

I have a table below:

Text
Monday is windy and raining
Tuesday is sunny
Wednesday is snowing and cold


And I also have a list contains words:

windy
raining
sunny
snowing
cold


I want to extract terms in the list from table one:
So the result will be like:

Text Terms1 Terms2 Terms3
Monday is windy and raining windy raining
Tuesday is sunny sunny
Wednesday is snowing and cold snowing cold


Is there a way in R I can do it?

Thank you

Answer Source

Given a vector of words:

> words
[1] "windy"   "raining" "sunny"   "snowing" "cold"   

and a data frame with a text column:

> data
                           text
1   Monday is windy and raining
2              Tuesday is sunny
3 Wednesday is snowing and cold

You can easily create a matrix of true/false values for each word's presence in the text:

> sapply(words, function(w){grepl(w,data$text)})
     windy raining sunny snowing  cold
[1,]  TRUE    TRUE FALSE   FALSE FALSE
[2,] FALSE   FALSE  TRUE   FALSE FALSE
[3,] FALSE   FALSE FALSE    TRUE  TRUE

You can add this onto your data frame if you want:

> cbind(data, sapply(words, function(w){grepl(w,data$text)}))
                           text windy raining sunny snowing  cold
1   Monday is windy and raining  TRUE    TRUE FALSE   FALSE FALSE
2              Tuesday is sunny FALSE   FALSE  TRUE   FALSE FALSE
3 Wednesday is snowing and cold FALSE   FALSE FALSE    TRUE  TRUE

What you've given as the output you want looks like a very ragged data structure which might be correct to do as a list, but unless you can clarify that the code I've given should be enough for you to work this into whatever form you want with some basic R code. Look at the help for grep and friends to see how to search for strings within text.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download