Ffiirree Ffiirree - 1 month ago 8
R Question

Dynamically create multiple subsets based on unique column values

I have data with a timestamp column as shown here

v1 v2 v3 v4 v5
1 apple 2/20/2015 12:09:19 AM 100 98
2 pear 2/19/2015 12:09:16 AM 98 97
3 apple 2/19/2015 12:09:17 AM NA 80
4 apple 2/17/2015 12:09:11 AM 78 75
5 pear 2/20/2015 12:09:12 AM 50 62
6 cherry 2/21/2015 12:09:13 AM 75 75
7 apple 2/20/2015 12:09:14 AM 75 75


I want to determine if an entry occurred for each fruit type in each day. Both file-size and number of fruit types are large.

First for each fruit type I will want to dynamically return the subset e.g. for apple

v1 v2 v3 v4 v5
1 apple 2/20/2015 12:09:15 AM 100 98
3 apple 2/19/2015 12:09:15 AM NA 80
4 apple 2/17/2015 12:09:15 AM 78 75
7 apple 2/20/2015 12:09:14 AM 75 75


Then for each fruit type, I am looking to count if any entry occurred in a day (e.g. yes or no or 0 or 1 as below) e.g. for apple

v2 v3 sign
apple 2/17/2015 1
apple 2/18/2015 0
apple 2/19/2015 1
apple 2/20/2015 1
apple 2/20/2015 1


I am new to r and any guidance is helpful. I am currently using unique(df$v2) but getting stuck on hash or assign naming.

Answer

To return the subset

ap <- subset(df, v2 == "apple")

Then the below will, I think, get you what you want for apples. First, recode v3 to be a Date.

d$v3 <- as.Date(d$v3, format = "%m/%d/%y")

Then create a sequence of dates in the range you want, as a data frame, and merge it in, with sign initially set to 0 for all dates.

dates <- data.frame(v3 = seq.Date(
                     from = as.Date("2/17/15", format = "%m/%d/%y"), 
                     to = as.Date("2/21/15", format = "%m/%d/%y"),
                     by = "days"),
                sign = 0)

ap <- merge(ap, dates, all = TRUE, by = "v3")

Finally, recode sign to 1 when there's valid data

ap$sign <- ifelse(!is.na(ap$v4)|!is.na(ap$v5), 1, ap$sign)
ap
          v3    v2  v4 v5 sign
 1 2015-02-17 apple  78 75    1
 2 2015-02-18  <NA>  NA NA    0
 3 2015-02-19 apple  NA 80    1
 4 2015-02-20 apple 100 98    1
 5 2015-02-20 apple  75 75    1
 6 2015-02-21  <NA>  NA NA    0

You could generalize this approach with all fruits by splitting the data frame first, and then essentially looping over the list to do all the same steps.

splt <- split(d, d$v2)
splt <- lapply(seq_along(splt), function(i) merge(splt[[i]], dates, by = "v3", all = TRUE))
lapply(splt, function(x) {
    x$sign <- ifelse(!is.na(x$v4)|!is.na(x$v5), 1, x$sign)
x
})

[[1]]
          v3    v2  v4 v5 sign
1 2015-02-17 apple  78 75    1
2 2015-02-18  <NA>  NA NA    0
3 2015-02-19 apple  NA 80    1
4 2015-02-20 apple 100 98    1
5 2015-02-20 apple  75 75    1
6 2015-02-21  <NA>  NA NA    0

[[2]]
          v3     v2 v4 v5 sign
1 2015-02-17   <NA> NA NA    0
2 2015-02-18   <NA> NA NA    0
3 2015-02-19   <NA> NA NA    0
4 2015-02-20   <NA> NA NA    0
5 2015-02-21 cherry 75 75    1

[[3]]
          v3   v2 v4 v5 sign
1 2015-02-17 <NA> NA NA    0
2 2015-02-18 <NA> NA NA    0
3 2015-02-19 pear 98 97    1
4 2015-02-20 pear 50 62    1
5 2015-02-21 <NA> NA NA    0

Edit

I should also have mentioned, a much easier way if all you want is the number of entries per day by fruit, is with dplyr, as follows:

d %>% 
    group_by(v2, v3) %>% 
    summarize(n = n())

      v2         v3     n
   <chr>     <date> <int>
1  apple 2015-02-17     1
2  apple 2015-02-19     1
3  apple 2015-02-20     2
4 cherry 2015-02-21     1
5   pear 2015-02-19     1
6   pear 2015-02-20     1

But that doesn't seem to be quite what you were looking for, which is why I took the approach I did.

Comments