BLT BLT - 3 months ago 15
R Question

Efficient way to find manager's manager's id

I have a database of employees, with their manager's id, in long format (one row per employee per month). I would like to add a column that contains their manager's manager's id (or the id of their skip level manager).

Here is a toy dataset:

id <- c(seq.int(1,11), seq.int(2,12))
mgr_id <- as.integer(c(NA, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, #period 1
NA, 2, 5, 2, 5, 5, 3, 3, 5, 10, 10)) #period 2
period <- c(rep(1, 11), rep(2, 11))
left_company <- c(1, rep(0, 21))
joined_company <- c(rep(0, 21), 1)

df <- data.frame(id, mgr_id, period, left_company, joined_company)


And here is a function I wrote that returns the expected results.

# finds the employee's manager in the correct period, and returns that manager's id

get_mgr_mgr_id <- function(manager_id, period){
mgr_mgr_id <- df$mgr_id[df$id == manager_id & df$period == period]
return(mgr_mgr_id[1])
}


When I use the function with
mapply
, all is well. Note that employee 1 left the company, and they were replaced by employee 5, who was replaced by employee 10, who was replaced by employee 12, a new hire.

df$mgr_mgr_id <- mapply(get_mgr_mgr_id, df$mgr_id, df$period)

df
id mgr_id period left joined mgr_mgr_id
1 1 NA 1 1 0 NA
2 2 1 1 0 0 NA
3 3 1 1 0 0 NA
4 4 2 1 0 0 1
5 5 2 1 0 0 1
6 6 2 1 0 0 1
7 7 2 1 0 0 1
8 8 3 1 0 0 1
9 9 3 1 0 0 1
10 10 5 1 0 0 2
11 11 5 1 0 0 2
12 2 NA 2 0 0 NA
13 3 2 2 0 0 NA
14 4 5 2 0 0 2
15 5 2 2 0 0 NA
16 6 5 2 0 0 2
17 7 5 2 0 0 2
18 8 3 2 0 0 2
19 9 3 2 0 0 2
20 10 5 2 0 0 2
21 11 10 2 0 0 5
22 12 10 2 0 1 5


My question: is there is a more efficient way to get this result? Currently it takes quite a long time to run even on 10,000 rows, and my dataset has closer to a million.

I'm also open to suggestions on a more general question title (possibly a version of this SQL question: Most efficient way to find something recursively in a table?)

Answer

You can run a join with data.table. I'm not sure how much faster it will be:

library(data.table)
setDT(df)

df[, m2id := df[.(id = mgr_id, period = period), on=c("id", "period"), mgr_id]]

    id mgr_id period left_company joined_company m2id
 1:  1     NA      1            1              0   NA
 2:  2      1      1            0              0   NA
 3:  3      1      1            0              0   NA
 4:  4      2      1            0              0    1
 5:  5      2      1            0              0    1
 6:  6      2      1            0              0    1
 7:  7      2      1            0              0    1
 8:  8      3      1            0              0    1
 9:  9      3      1            0              0    1
10: 10      5      1            0              0    2
11: 11      5      1            0              0    2
12:  2     NA      2            0              0   NA
13:  3      2      2            0              0   NA
14:  4      5      2            0              0    2
15:  5      2      2            0              0   NA
16:  6      5      2            0              0    2
17:  7      5      2            0              0    2
18:  8      3      2            0              0    2
19:  9      3      2            0              0    2
20: 10      5      2            0              0    2
21: 11     10      2            0              0    5
22: 12     10      2            0              1    5
    id mgr_id period left_company joined_company m2id

How it works

The syntax for a join is x[i, on=, j]. It uses i and on to subset x and then returns j. The key point here is setting id = mgr_id in i so we're subsetting to the manager's rows.

The syntax for assigning a column is DT[, col_name := value]. In this case, the value comes from the join explained in the last paragraph.