David Leal David Leal -4 years ago 94
R Question

Group by id a dataset then create a new column subtracting different columns from current and subsequent row

I have found similar problem like this, but not exactly the same. Here is my problem, I have the following data set:

> ds
id begin end
1 1 2017-01-15 2017-01-17
2 1 2017-01-01 2017-01-03
3 2 2017-02-01 2017-02-28
4 4 2017-04-11 2017-05-11
5 3 2017-02-05 2017-02-10
6 4 2017-03-10 2017-03-20
7 1 2017-01-30 2017-02-03
8 3 2017-02-28 2017-03-09
9 4 2017-02-26 2017-03-05


I want to create the following column:
check
that verifies the following condition for each rows with the same
id
value:

ds[i,]$begin - ds[i-1,]$end < 30 => 1 # for each row i


otherwise is
0
. When it is the first element of the group, then there is no previous information, so in such case the value of the new column
check
will be always zero too.

The difference with other questions (solved via:
ave
,
dplyr
) I have seen, is that I need to make a calculation that involves subsequent row but also not the same column.

Here is the code:

Defining the data set

id <- c("1", "1", "2", "4", "3", "4", "1", "3")
begin <- c("20170115", "20170101", "20170201",
"20170411",
"20170205", "20170310",
"20170130", "20170228"
)

end <- c("20170117", "20170103", "20170228",
"20170511",
"20170210", "20170320",
"20170203", "20170309"
)

ds <- data.frame(id = id, begin = as.Date(begin, "%Y%m%d"), end = as.Date(end, "%Y%m%d"))


Sorting the information (we need it for the current solution using a for-loop)

idx = order(rank(ds$id), ds$begin, decreasing = FALSE)
ds <- ds[idx,]


Now using the for-loop for assigning the control variable:
check
:

ds$check <- numeric(nrow(ds))
ds$check <- NA_integer_

nrep <- -1
for (i in 1:nrow(ds)) {
rowi <- ds[i,]
if (nrep == -1) {# Setting the first element of ds
end.prev <- rowi$end
id.prev <- rowi$id
ds[i,]$check <- 0
nrep = 1
} else {
id.current <- rowi$id
if(id.prev == id.current) {
ds[i,]$check <- ifelse(rowi$begin - end.prev < 30, 1, 0)
} else {
ds[i,]$check <- 0
}
end.prev <- rowi$end
id.prev <- id.current
}
}


Finally the expected output:

> ds
id begin end check
2 1 2017-01-01 2017-01-03 0
1 1 2017-01-15 2017-01-17 1
7 1 2017-01-30 2017-02-03 1
3 2 2017-02-01 2017-02-28 0
5 3 2017-02-05 2017-02-10 0
8 3 2017-02-28 2017-03-09 1
6 4 2017-03-10 2017-03-20 0
4 4 2017-04-11 2017-05-11 1
>


Thanks for any hint.

Answer Source

With dplyr, using lag makes this easy:

ds %>% 
  group_by(id) %>% 
  arrange(id, begin) %>% 
  mutate(check = c(0, as.numeric(begin - lag(end) < 30)[-1]))

Gives:

Source: local data frame [8 x 4]
Groups: id [4]

      id      begin        end check
  <fctr>     <date>     <date> <dbl>
1      1 2017-01-01 2017-01-03     0
2      1 2017-01-15 2017-01-17     1
3      1 2017-01-30 2017-02-03     1
4      2 2017-02-01 2017-02-28     0
5      3 2017-02-05 2017-02-10     0
6      3 2017-02-28 2017-03-09     1
7      4 2017-03-10 2017-03-20     0
8      4 2017-04-11 2017-05-11     1

Explanation:

  • First we group_by, so calculations are done by id.
  • We arrange to make sure we start with the earliest date.
  • For all checks, the first value is 0,
  • and the other values are simply whether the current begin minus the previous end is smaller than 30. We use as.numeric to convert from logical to numeric.

Also see: vignette('window-functions').

Note: I think the first value per group can simply be left at NA, since it is indeed unknown or undefined, and this will simplify it to:

ds %>% 
  group_by(id) %>%
  mutate(check = as.numeric(begin - lag(end, order_by = begin) < 30))
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download