BLT - 1 year ago 65
R Question

# Efficient way to find manager's manager's id

I have a database of employees, with their manager's id, in long format (one row per employee per month). I would like to add a column that contains their manager's manager's id (or the id of their skip level manager).

Here is a toy dataset:

``````id <- c(seq.int(1,11), seq.int(2,12))
mgr_id <- as.integer(c(NA, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5,   #period 1
NA, 2, 5, 2, 5, 5, 3, 3, 5, 10, 10)) #period 2
period <- c(rep(1, 11), rep(2, 11))
left_company <- c(1, rep(0, 21))
joined_company <- c(rep(0, 21), 1)

df <- data.frame(id, mgr_id, period, left_company, joined_company)
``````

And here is a function I wrote that returns the expected results.

``````# finds the employee's manager in the correct period, and returns that manager's id

get_mgr_mgr_id <- function(manager_id, period){
mgr_mgr_id <- df\$mgr_id[df\$id == manager_id & df\$period == period]
return(mgr_mgr_id[1])
}
``````

When I use the function with
`mapply`
, all is well. Note that employee 1 left the company, and they were replaced by employee 5, who was replaced by employee 10, who was replaced by employee 12, a new hire.

``````df\$mgr_mgr_id <- mapply(get_mgr_mgr_id, df\$mgr_id, df\$period)

df
id mgr_id period left joined mgr_mgr_id
1   1     NA      1    1      0         NA
2   2      1      1    0      0         NA
3   3      1      1    0      0         NA
4   4      2      1    0      0          1
5   5      2      1    0      0          1
6   6      2      1    0      0          1
7   7      2      1    0      0          1
8   8      3      1    0      0          1
9   9      3      1    0      0          1
10 10      5      1    0      0          2
11 11      5      1    0      0          2
12  2     NA      2    0      0         NA
13  3      2      2    0      0         NA
14  4      5      2    0      0          2
15  5      2      2    0      0         NA
16  6      5      2    0      0          2
17  7      5      2    0      0          2
18  8      3      2    0      0          2
19  9      3      2    0      0          2
20 10      5      2    0      0          2
21 11     10      2    0      0          5
22 12     10      2    0      1          5
``````

My question: is there is a more efficient way to get this result? Currently it takes quite a long time to run even on 10,000 rows, and my dataset has closer to a million.

I'm also open to suggestions on a more general question title (possibly a version of this SQL question: Most efficient way to find something recursively in a table?)

You can run a join with data.table. I'm not sure how much faster it will be:

``````library(data.table)
setDT(df)

df[, m2id := df[.(id = mgr_id, period = period), on=c("id", "period"), mgr_id]]

id mgr_id period left_company joined_company m2id
1:  1     NA      1            1              0   NA
2:  2      1      1            0              0   NA
3:  3      1      1            0              0   NA
4:  4      2      1            0              0    1
5:  5      2      1            0              0    1
6:  6      2      1            0              0    1
7:  7      2      1            0              0    1
8:  8      3      1            0              0    1
9:  9      3      1            0              0    1
10: 10      5      1            0              0    2
11: 11      5      1            0              0    2
12:  2     NA      2            0              0   NA
13:  3      2      2            0              0   NA
14:  4      5      2            0              0    2
15:  5      2      2            0              0   NA
16:  6      5      2            0              0    2
17:  7      5      2            0              0    2
18:  8      3      2            0              0    2
19:  9      3      2            0              0    2
20: 10      5      2            0              0    2
21: 11     10      2            0              0    5
22: 12     10      2            0              1    5
id mgr_id period left_company joined_company m2id
``````

How it works

The syntax for a join is `x[i, on=, j]`. It uses `i` and `on` to subset `x` and then returns `j`. The key point here is setting `id = mgr_id` in `i` so we're subsetting to the manager's rows.

The syntax for assigning a column is `DT[, col_name := value]`. In this case, the value comes from the join explained in the last paragraph.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download