Andrew Andrew - 3 days ago 5
R Question

Find matches of a vector of strings in another vector of strings

I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases.

# Sample data frame of articles
articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse"))
articles$text <- as.character(articles$text)

# Sample vector of keywords or phrases
keywords <- as.character(c("elit", "tempor incididunt", "reprehenderit"))

# id text
# 1 1 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# 2 2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
# 3 3 quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
# 4 4 consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse


Given the vector of keywords, the subset should contain rows 1, 2, and 4, since those rows contain one or more of the elements of the vector.

Neither
%in
nor
grepl()
work, since
%in%
seems to require that each word in the data frame be vectorized (
articles$text %in% keywords
results in four
FALSE
s), and
grep()
doesn't seem to be able to handle vectorized patterns (
grep(keywords, articles$text)
gives an error). Neither function alone seems to work well across multiple dimensions (i.e. it would be easy to search for one word in all the rows, but not all 3 at the same time).

What's the best way to find and select all rows of the data frame that contain at least one of the elements of the keyword vector?

Answer

You can try pasting your "keywords" together and separate them with the pipe character (|) which will work like an "or", like this:

> articles[grepl(paste(keywords, collapse="|"), articles$text),]
  id                                                                         text
1  1     Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
2  2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
4  4    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
Comments