Jordan Michnoff Jordan Michnoff - 1 month ago 17
R Question

How to subset data in a loop in R?

I'm trying to write a loop in R that has the same effect on a dataset as the following code:

AD1 <- subset(data, AD1==1)
AD2 <- subset(data, AD2==1)
AD3 <- subset(data, AD3==1)
AD4 <- subset(data, AD4==1)
AD5 <- subset(data, AD5==1)


After a few attempts, there is where I'm stuck:

for (i in 1:5) {
adNum <- paste("AD", i, sep="")
aSubset <- paste(adNum, " <- subset(data, ", adNum, "==1)", sep="")
aSubset
}


Ideally, I'd like to be able to apply the same principle to loop this code for all 5 subsets:

uAD1 <- mean(AD1$u)
vAD1 <- mean(AD1$v)
wAD1 <- mean(AD1$w)
xAD1 <- mean(AD1$x)
yAD1 <- mean(AD1$y)
zAD1 <- mean(AD1$z)

Answer

It sounds like you're in "split-apply-combine" territory: you want to cut up a large data set into subsets, and apply the same operations to those subsets (or different columns of those subsets). While using looped code and parsed evaluation isn't wrong per se, this method tends to be fragile (breaks if something subtle changes), error-prone, and hard to read.

R has functions such as aggregate and lapply/vapply that are useful for these purposes. There are now also several mature R packages that help a person do exactly these types of operations (check out the data.table and tidyverse packages).

Consider the "diamonds" data set (which comes with the tidyverse packages). The first few rows look like this:

  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48 

Suppose we wanted the average depth, table, and price for each type of cut. In base R, you could use aggregate:

data.agg <- aggregate(diamonds[c('depth', 'table', 'price')], by = diamonds['cut'], FUN = mean)

        cut    depth    table    price
1      Fair 64.04168 59.05379 4358.758
2      Good 62.36588 58.69464 3928.864
3 Very Good 61.81828 57.95615 3981.760
4   Premium 61.26467 58.74610 4584.258
5     Ideal 61.70940 55.95167 3457.542

Or using tidyverse's dplyr package for data manipulation:

library(dplyr)

data.dplyr <- diamonds %>% 
    group_by(cut) %>% 
    select(depth, table, price) %>% 
    summarize_all(mean)

        cut    depth    table    price
      <ord>    <dbl>    <dbl>    <dbl>
1      Fair 64.04168 59.05379 4358.758
2      Good 62.36588 58.69464 3928.864
3 Very Good 61.81828 57.95615 3981.760
4   Premium 61.26467 58.74610 4584.258
5     Ideal 61.70940 55.95167 3457.542

While dplyr is more verbose than the basic aggregate function (more typing), it's more flexible and allows for more advanced data manipulation and summarization operations. Note that both versions are much more succinct than the looping alternative, and easy to read and follow.

Comments