JHegg JHegg - 3 years ago 121
R Question

Creating unique groups in sequential data that repeats through time

This kind of thing has been asked before, but not quite in this way that I can find.

Thread about creating sequential IDs, with several additional links

It's not hard to create identifiers in a sequence, but my data includes a temporal element that has thrown me for a loop. The following data is an imaginary dataset just to illustrate the problem in something tractable:

dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L,
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve",
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4,
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood",
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")

neighborhood street sequence visit
1 Uptown Hennepin 1 1
2 Uptown Dupont 2 1
3 Uptown Dupont 3 1
4 Uptown Lyndale 4 1
5 Uptown Dupont 5 2
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 14thAve 2 1
13 Dinkytown 12thAve 3 2


All data are, for the sake of imagination, from three individuals walking east in three neighborhoods of Minneapolis. Each row represents a time when their location was recorded. The first column is the neighborhood they are walking through. The second column is the intersection where they were located at each time-point. The third column is the sequence that these data occurred.

I want to create the
visit
column that records sequential time-points at the same street, in the same neighborhood, as a single visit, and later return visits as the next visit. How do I create this sort of sequential identifier?




I was thinking this
ave()
with
FUN=seq_along
trick might work, but I can't find a way of combining the factors that gets me where I want to be.

Create a sequential number (counter) for rows within each group of a dataframe [duplicate]

Uwe Uwe
Answer Source

The difficulty here is that subsequent recordings to the same street in the same neighborhood should be counted as one visit. This requires to collapse these rows into one, count the visits to different neighborhoods & streets and finally expand this to the original number of rows.

Note that column visit containing the expected result is not overwritten but kept for comparison with the computed visit_new column.

library(data.table)
setDT(DF)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      DF, on = .(neighborhood, street, sequence), roll = TRUE, visit_new := x.visit_new][]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         1
 3:       Uptown    Dupont        3     1         1
 4:       Uptown   Lyndale        4     1         1
 5:       Uptown    Dupont        5     2         2
 6:     Downtown  Hennepin        1     1         1
 7:     Downtown Marquette        2     1         1
 8:     Downtown Nicolette        3     1         1
 9:     Downtown  Hennepin        4     2         2
10:     Downtown  Hennepin        5     2         2
11:    Dinkytown   12thAve        1     1         1
12:    Dinkytown   14thAve        2     1         1
13:    Dinkytown   12thAve        3     2         2

Explanation step by step

DF is coerced to data.table. The rleid() function creates unique numbers for changes in neighborhood & street.

 setDT(DF)[, visit_new := rleid(neighborhood, street)][]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         2
 3:       Uptown    Dupont        3     1         2
 4:       Uptown   Lyndale        4     1         3
 5:       Uptown    Dupont        5     2         4
 6:     Downtown  Hennepin        1     1         5
 7:     Downtown Marquette        2     1         6
 8:     Downtown Nicolette        3     1         7
 9:     Downtown  Hennepin        4     2         8
10:     Downtown  Hennepin        5     2         8
11:    Dinkytown   12thAve        1     1         9
12:    Dinkytown   14thAve        2     1        10
13:    Dinkytown   12thAve        3     2        11

Note that rows 2 & 3 are repeated as well as rows 9 & 10. The duplicates are removed in the next step which creates a new, temporary data.table object:

setDT(DF)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         2
 3:       Uptown   Lyndale        4     1         3
 4:       Uptown    Dupont        5     2         4
 5:     Downtown  Hennepin        1     1         5
 6:     Downtown Marquette        2     1         6
 7:     Downtown Nicolette        3     1         7
 8:     Downtown  Hennepin        4     2         8
 9:    Dinkytown   12thAve        1     1         9
10:    Dinkytown   14thAve        2     1        10
11:    Dinkytown   12thAve        3     2        11

Now, we can number the visits to distinct neighborhoods and streets using the rowid() function:

setDT(DF)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][]
    neighborhood    street sequence visit visit_new
 1:       Uptown  Hennepin        1     1         1
 2:       Uptown    Dupont        2     1         1
 3:       Uptown   Lyndale        4     1         1
 4:       Uptown    Dupont        5     2         2
 5:     Downtown  Hennepin        1     1         1
 6:     Downtown Marquette        2     1         1
 7:     Downtown Nicolette        3     1         1
 8:     Downtown  Hennepin        4     2         2
 9:    Dinkytown   12thAve        1     1         1
10:    Dinkytown   14thAve        2     1         1
11:    Dinkytown   12thAve        3     2         2

Finally, we need to expand the result to the original number of rows again. This is accomplished by a rolling join of the temporary data.table with the original DF (all rows included) and an update in the join:

setDT(DF)[, visit_new := rleid(neighborhood, street)][
  , unique(.SD, by = "visit_new")][
    , visit_new := rowid(neighborhood, street)][
      DF, on = .(neighborhood, street, sequence), roll = TRUE, visit_new := x.visit_new][]

Perhaps, it might be worth to note that visit_new is used and re-used to hold temporary data through various stages until the final update.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download