watchtower watchtower - 1 month ago 7
R Question

Select grouped rows with at least one matching criterion

I want to select all those groupings that contain at least one of the elements that I am interested in. I was able to do this by creating an intermediate array, but I am looking for something simpler and faster. This is because my actual data set has over 1M rows (and 20 columns) so I am not sure whether I will have sufficient memory to create an intermediate array. More importantly, the below method on my original file takes a lot of time.

Here's my code and data:

a) Data

dput(Data_File)
structure(list(Group_ID = c(123, 123, 123, 123, 234, 345, 444,
444), Product_Name = c("ABCD", "EFGH", "XYZ1", "Z123", "ABCD",
"EFGH", "ABCD", "ABCD"), Qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .Names = c("Group_ID",
"Product_Name", "Qty"), row.names = c(NA, 8L), class = "data.frame")


b) Code: I want to select
Group_ID
that has at least one
Product_Name = ABCD


#Find out transactions
Data_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Product_Name == "ABCD") %>%
select(Group_ID) %>%
distinct()

#Now filter them
Filtered_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Group_ID %in% Data_T$Group_ID)


c) Expected output is

Group_ID Product_Name Qty
<dbl> <chr> <dbl>
123 ABCD 2
123 EFGH 3
123 XYZ1 4
123 Z123 5
234 ABCD 6
444 ABCD 8
444 ABCD 9


I'm struggling with this for over 3 hours now. I looked at the auto-suggested thread by SO: Select rows with at least two conditions from all conditions but my question is very different.

Answer

I would do it like this:

Data_File %>% group_by(Group_ID) %>%
    filter(any(Product_Name %in% "ABCD"))
# Source: local data frame [7 x 3]
# Groups: Group_ID [3]
# 
#   Group_ID Product_Name   Qty
#      <dbl>        <chr> <dbl>
# 1      123         ABCD     2
# 2      123         EFGH     3
# 3      123         XYZ1     4
# 4      123         Z123     5
# 5      234         ABCD     6
# 6      444         ABCD     8
# 7      444         ABCD     9

Explanation: any() will return TRUE if there are any rows (within the group) that match the condition. The length-1 result will then be recycled to the full length of the group and the entire group will be kept. You could also do it with sum(Product_name %in% "ABCD") > 0 as the condition, but the any reads very nicely. Use sum instead if you wanted a more complicated condition, like 3 or more matching product names.

I prefer%in%to == for things like this because it has better behavior with NA and it is easy to expand if you wanted to check for any of multiple products by group.


If speed and efficiency are an issue, data.table will be faster. I would do it like this, which relies on a keyed join for the filtering and uses no non-data.table operations, so it should be very fast:

library(data.table)
df = as.data.table(df)
setkey(df)
groups = unique(subset(df, Product_Name %in% "ABCD", Group_ID))
df[groups, nomatch = 0]
#    Group_ID Product_Name Qty
# 1:      123         ABCD   2
# 2:      123         EFGH   3
# 3:      123         XYZ1   4
# 4:      123         Z123   5
# 5:      234         ABCD   6
# 6:      444         ABCD   8
# 7:      444         ABCD   9