RCN RCN - 2 months ago 13
R Question

Match multiple items in list to string in R

I have the following dataframe and am struggling to detect list items within a separate string element.
In the following dataframe:

original_df <- structure(list(title = c("Film Review: Almost Christmas", "Film Review: Mascots",
"Women s Basketball Upstages No. 2 California Baptist", "Men s Basketball Goes 2-0 In Opening Home Matchups",
"Women s Soccer Wins 16th Consecutive Game, Moves Onto Third Round of Tournament",
"The Hype About Hullabaloo"), tags = c("[u'Arts & Entertainment', u'Films & TV', u'Trending', u'Almost Christmas', u'Danny Glover', u'David E. Talbert', u'family', u'Film', u'Gabrielle Union', u'Holiday', u'JB Smoove', u'movie', u'review']",
"[u'Arts & Entertainment', u'Films & TV', u'Homepage', u'Trending', u'Chris O\\u2019Dowd', u'Christopher Guest', u'Ed Begley Jr.', u'Film', u'Fred Willard', u'Jane Lynch', u'Mascots', u'movie', u'Netflix', u'Parker Posey', u'review', u'Spinal Tap']",
"[u'Basketball', u'Homepage', u'Sports', u'Trending', u'Beth Mounier', u'cassie macleod', u'Dalayna Sampton', u'Joleen Yang', u'Mikayla Williams', u'Taylor Tanita', u'UCSD', u\"Women's Basketball\"]",
"[u'Basketball', u'Homepage', u'Sports', u'Trending', u'Adam Klie', u'Azusa Pacific University', u'CCAA', u'Dixie State', u\"Men's Basketball\", u'Tritons', u'UCSD']",
"[u'Homepage', u'Soccer', u'Sports', u'Trending', u'Azusa Pacific', u'Jordyn McNutt', u\"Katie O'Laughlin\", u'Mary Reilly', u'NCAA Division-II', u'UCSD', u\"Women's Soccer\"]",
"[u'Arts & Entertainment', u'Music', u'Slider', u'AS', u'asce', u'Concerts', u'Council', u\"Founder's Day\", u'Hullabaloo', u'Isaiah Rashad', u'Rap', u'Responsible Action Protocol', u'sun god', u'UCSD']"
)), .Names = c("title", "tags"), row.names = 215:220, class = "data.frame")

there is a title column and tags column. For data manipulation reasons, the tags column is not a list. It is a string of what looks like an array.

I have a separate list called sports which is a list of various sports.

sports <- c("Basketball", "Soccer", "Baseball")

I would like to create a new column in the original dataframe that would indicate which sport has been detected.
I started to use grepl and created the following function:

detectSports <- function(sport_item){
sport_in_tag <- grepl(tolower(sport_item),tolower(original_df$tags))

and applied this function to the list of sports:

ss <- lapply(sports, detectSports)

The result is a list with logical vectors.
I am having trouble matching this back to my original dataframe.I believe I could utilize colnames but am not quite positive how that works.

Appreciate any advice!


Assuming that you have at most one match with any sport for each row (in case you have multiple matches simultaneously, those sports will be separated by commas), you can try the following (no match with any sport is indicated by the blank character in the new column sports in the original_df):

original_df$sports <- unlist(apply(t(do.call(rbind, lapply(sports, detectSports))), 1, 
                 function(x) ifelse (any(x), paste(sports[which(x)], collapse=','), '')))


# [1] ""           ""           "Basketball" "Basketball" "Soccer"     ""