Andy.Jian - 1 month ago 4x
R Question

# How can I efficiently subset for big data.frame objects in the list?

I have data.frame object in the list, I am gonna filter out based on last column (A.K.A, score) of each respectively. subsetting for list is intuitive for me, but I want to have two different set (i.e., pass/fail) as result of filtration for each data.frame object. I think the way I used is not elegant, and seeking better/efficient solution for it. Can anyone point me how to achieve more elegant solution for this sort of task? Thanks a lot !

toy data:

``````mylist <- list(df1=data.frame( from=seq(1, by=4, len=16), to=seq(3, by=4, len=16), score=sample(30, 16)),
df2=data.frame( from=seq(3, by=7, len=20), to=seq(6, by=7, len=20), score=sample(30, 20)),
df3=data.frame( from=seq(4, by=8, len=25), to=seq(7, by=8, len=25), score=sample(30, 25)))
``````

my initial attempt:

``````pass <- lapply(mylist, function(ele_) {
ans <- subset(ele_, ele_\$score > 20)
ans
})
``````

turns out I want also to have its opposite set where instances didn't meet the filtering condition, and put pass, fail set for each data.frame object into one list.

desired output: each data.frame object can have pass, fail set after subsetting.

what's better for doing this if I have very big data.frame object in the list? Does anyone have knows useful trick for this?

May be the `data.table` option would be more efficient

``````library(data.table)
lapply(mylist, function(x) setDT(x)[score > 20])
``````

Or use `filter` from `dplyr` with `map` from `purrr`

``````library(dplyr)
library(purrr)
mylist %>%
map(filter, score > 20)
``````

In addition to the above methods of subsetting each `list` element, we can also `rbind` the datasets into a single dataset (`rbindlist` from `data.table` or `bind_rows` from `dplyr` with an identifier column) and subset by group.

``````rbindlist(mylist, idcol= 'grp')[score > 20, .SD , by = .(grp)]
``````

Or with `dplyr`

``````mylist %>%
bind_rows(., .id = 'grp') %>%
group_by(grp) %>%
filter(score > 20)
``````

If the intention is to separate the `data.frame` into a `list` of 2 (> 20 and < 20 for 'score')

``````lapply(mylist, function(x) split(x, c("FAIL", "PASS")[(x\$score > 20)+1]))
``````