Sescopeland Sescopeland - 1 month ago 15
R Question

How do I filter "CUSTOM" out of a string but not "CUSTOMER" in R? Grepl?

I'm trying to filter out custom items from a data frame in R using item descriptions. I want to get rid of all items with "CUSTOM" in the description, but I need to keep items with "CUSTOMER" in the description. I tried using a grepl function but to no avail. I've got 800,000+ rows of data, so something speedy would be helpful. This is just one filter out of many, so I am using dplyr and pipe operators with my other filters.

Generic code:

> items <- c("A", "B", "C")
> desc <- c("CUSTOM STAMP", "CUSTOMER 4X6 IN STAMP", "4X6 GENERIC STAMP")
> df <- data.frame(Items = items, Item_Desc = desc)
> df
Items Item_Desc
1 A CUSTOM STAMP
2 B CUSTOMER 4X6 IN STAMP
3 C 4X6 GENERIC STAMP


I've tried something like this:

library(dplyr)
df <- df %>%
filter(!grepl("CUSTOM", Item_Desc, fixed = TRUE))


But obviously, the result is:

> df
Items Item_Desc
1 C 4X6 GENERIC STAMP


Whereas the desired result would be:

> df
Items Item_Desc
1 B CUSTOMER 4X6 IN STAMP
2 C 4X6 GENERIC STAMP


Thanks!

Answer Source

You need to use a regular expression here that utilizes word boundaries, "\\bCUSTOM\\b".

To make it work, you need to remove fixed=TRUE as this argument makes the engine treat the pattern as a literal string, not a pattern.

Use

df <- df %>% 
        filter(!grepl("\\bCUSTOM\\b", Item_Desc))    

See what the pattern matches. Only the items that do not match will remain in the df because the result of grepl is inverted with ! operator.