David Leal -4 years ago 94
R Question

# Group by id a dataset then create a new column subtracting different columns from current and subsequent row

I have found similar problem like this, but not exactly the same. Here is my problem, I have the following data set:

``````> ds
id      begin        end
1  1 2017-01-15 2017-01-17
2  1 2017-01-01 2017-01-03
3  2 2017-02-01 2017-02-28
4  4 2017-04-11 2017-05-11
5  3 2017-02-05 2017-02-10
6  4 2017-03-10 2017-03-20
7  1 2017-01-30 2017-02-03
8  3 2017-02-28 2017-03-09
9  4 2017-02-26 2017-03-05
``````

I want to create the following column:
`check`
that verifies the following condition for each rows with the same
`id`
value:

``````ds[i,]\$begin - ds[i-1,]\$end < 30 => 1 # for each row i
``````

otherwise is
`0`
. When it is the first element of the group, then there is no previous information, so in such case the value of the new column
`check`
will be always zero too.

The difference with other questions (solved via:
`ave`
,
`dplyr`
) I have seen, is that I need to make a calculation that involves subsequent row but also not the same column.

Here is the code:

Defining the data set

``````id <- c("1", "1", "2", "4", "3", "4", "1", "3")
begin <- c("20170115", "20170101", "20170201",
"20170411",
"20170205", "20170310",
"20170130", "20170228"
)

end <- c("20170117", "20170103", "20170228",
"20170511",
"20170210", "20170320",
"20170203", "20170309"
)

ds <- data.frame(id = id, begin = as.Date(begin, "%Y%m%d"), end = as.Date(end, "%Y%m%d"))
``````

Sorting the information (we need it for the current solution using a for-loop)

``````idx = order(rank(ds\$id), ds\$begin, decreasing = FALSE)
ds <- ds[idx,]
``````

Now using the for-loop for assigning the control variable:
`check`
:

``````ds\$check <- numeric(nrow(ds))
ds\$check <- NA_integer_

nrep <- -1
for (i in 1:nrow(ds)) {
rowi <- ds[i,]
if (nrep == -1) {# Setting the first element of ds
end.prev <- rowi\$end
id.prev <- rowi\$id
ds[i,]\$check <- 0
nrep = 1
} else {
id.current <- rowi\$id
if(id.prev == id.current) {
ds[i,]\$check <- ifelse(rowi\$begin - end.prev < 30, 1, 0)
} else {
ds[i,]\$check <- 0
}
end.prev <- rowi\$end
id.prev <- id.current
}
}
``````

Finally the expected output:

``````> ds
id      begin        end check
2  1 2017-01-01 2017-01-03     0
1  1 2017-01-15 2017-01-17     1
7  1 2017-01-30 2017-02-03     1
3  2 2017-02-01 2017-02-28     0
5  3 2017-02-05 2017-02-10     0
8  3 2017-02-28 2017-03-09     1
6  4 2017-03-10 2017-03-20     0
4  4 2017-04-11 2017-05-11     1
>
``````

Thanks for any hint.

With `dplyr`, using `lag` makes this easy:

``````ds %>%
group_by(id) %>%
arrange(id, begin) %>%
mutate(check = c(0, as.numeric(begin - lag(end) < 30)[-1]))
``````

Gives:

``````Source: local data frame [8 x 4]
Groups: id [4]

id      begin        end check
<fctr>     <date>     <date> <dbl>
1      1 2017-01-01 2017-01-03     0
2      1 2017-01-15 2017-01-17     1
3      1 2017-01-30 2017-02-03     1
4      2 2017-02-01 2017-02-28     0
5      3 2017-02-05 2017-02-10     0
6      3 2017-02-28 2017-03-09     1
7      4 2017-03-10 2017-03-20     0
8      4 2017-04-11 2017-05-11     1
``````

Explanation:

• First we `group_by`, so calculations are done by `id`.
• We `arrange` to make sure we start with the earliest date.
• For all `check`s, the first value is `0`,
• and the other values are simply whether the current `begin` minus the previous `end` is smaller than 30. We use `as.numeric` to convert from logical to numeric.

Also see: `vignette('window-functions')`.

Note: I think the first value per group can simply be left at `NA`, since it is indeed unknown or undefined, and this will simplify it to:

``````ds %>%
group_by(id) %>%
mutate(check = as.numeric(begin - lag(end, order_by = begin) < 30))
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download