Andy.Jian Andy.Jian - 2 months ago 5
R Question

How can I efficiently subset for big data.frame objects in the list?

I have data.frame object in the list, I am gonna filter out based on last column (A.K.A, score) of each respectively. subsetting for list is intuitive for me, but I want to have two different set (i.e., pass/fail) as result of filtration for each data.frame object. I think the way I used is not elegant, and seeking better/efficient solution for it. Can anyone point me how to achieve more elegant solution for this sort of task? Thanks a lot !

toy data:

mylist <- list(df1=data.frame( from=seq(1, by=4, len=16), to=seq(3, by=4, len=16), score=sample(30, 16)),
df2=data.frame( from=seq(3, by=7, len=20), to=seq(6, by=7, len=20), score=sample(30, 20)),
df3=data.frame( from=seq(4, by=8, len=25), to=seq(7, by=8, len=25), score=sample(30, 25)))

my initial attempt:

pass <- lapply(mylist, function(ele_) {
ans <- subset(ele_, ele_$score > 20)

turns out I want also to have its opposite set where instances didn't meet the filtering condition, and put pass, fail set for each data.frame object into one list.

desired output: each data.frame object can have pass, fail set after subsetting.

what's better for doing this if I have very big data.frame object in the list? Does anyone have knows useful trick for this?


May be the data.table option would be more efficient

lapply(mylist, function(x) setDT(x)[score > 20])

Or use filter from dplyr with map from purrr

mylist %>% 
      map(filter, score > 20)

In addition to the above methods of subsetting each list element, we can also rbind the datasets into a single dataset (rbindlist from data.table or bind_rows from dplyr with an identifier column) and subset by group.

rbindlist(mylist, idcol= 'grp')[score > 20, .SD , by = .(grp)]

Or with dplyr

mylist %>% 
    bind_rows(., .id = 'grp') %>%
    group_by(grp) %>%
    filter(score > 20)

If the intention is to separate the data.frame into a list of 2 (> 20 and < 20 for 'score')

lapply(mylist, function(x) split(x, c("FAIL", "PASS")[(x$score > 20)+1]))