majesus majesus - 4 years ago 94
R Question

Remove terms from a text

I need to remove certain terms from a text:

texts
[1] "Lorem ipsum dolor sit amet"
[2] "consectetur adipiscing elit"


that fully match (i.e., whole term):

stopwords=read.csv("stopwords.txt", encoding = "UTF-8")

stopwords

[1] Lorem
[2] elit
[3] a


Results:

texts
[1] "ipsum dolor sit amet"
[2] "consectetur adipiscing"


I have tried removeWords but it does not work.

Thanks!
majesus

Answer Source

You mean removeWords from tm package? It works in my case:

 texts <- c("Lorem ipsum dolor sit amet", "consectetur adipiscing elit")
 stopwords <- c("Lorem","elit", "a")
 require("tm")
 trimws(removeWords(texts,stopwords))

Output:

[1] " ipsum dolor sit amet"  
[2] "consectetur adipiscing "

From @rajnim's answer using trimws function

Using gsub

trimws(gsub(paste0("\\b(",paste(stopwords, collapse="|"),")\\b"), "", texts)) 
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download