Shannon Shannon - 2 days ago 4
R Question

Remove certain words in string from column in dataframe in R

I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:

sampleData

Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.


Words I do not want to include in my output:

stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")


I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.

removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(sampleData$Company,stopwords)


The output for the above function looks like this:

[1] "XYZ Company Consulting Firm Smith"


T
he output should be:

Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith


Any help would be appreciated.

Answer

We can use 'tm' package

library(tm)

stopwords = readLines('stopwords.txt')     #Your stop words file
x  = df$company        #Company column data
x  =  removeWords(x,stopwords)     #Remove stopwords

df$company_new <- x     #Add the list as new column and check
Comments