gnotnek gnotnek - 4 months ago 19
R Question

more efficient way to recode groups?

My goal is to recode group_old to look like group_desired:

group_old <- c(58,58,57,57,57,56,56,56,59,59,56)
group_desired <- c(1,1,2,2,2,3,3,3,4,4,3)
df <- data.frame(group_old, group_desired)

> df
group_old group_desired
1 58 1
2 58 1
3 57 2
4 57 2
5 57 2
6 56 3
7 56 3
8 56 3
9 59 4
10 59 4
11 56 3


I was able to do it:

codex <- data.frame(old = unique(df$group_old), new = 1:length(unique(df$group_old)))

df$group_new <- sapply(df$group_old, FUN = function(x) codex$new[codex$old == x] )

> df
group_old group_desired group_new
1 58 1 1
2 58 1 1
3 57 2 2
4 57 2 2
5 57 2 2
6 56 3 3
7 56 3 3
8 56 3 3
9 59 4 4
10 59 4 4
11 56 3 3


However, this code runs very slowly on a dataset with 8 millions obs and 400k groups. Is there a more efficient way to accomplish the same thing for large data?

Answer

Using data.table:

We group by group_old, and then create a new column by reference. .GRP is a special symbol in data.table. Its a simple grouping counter. It assigns 1 to the first group, 2 to the second.. and so on

group_old <- c(58,58,57,57,57,56,56,56,59,59,56)
df <- data.frame(group_old = group_old)

library(data.table)
setDT(df)[,group_desired := .GRP, by = group_old]

 #   group_old group_desired
 #1:        58             1
 #2:        58             1
 #3:        57             2
 #4:        57             2
 #5:        57             2
 #6:        56             3
 #7:        56             3
 #8:        56             3
 #9:        59             4
#10:        59             4
#11:        56             3

Or using dplyr:

df$group_desired <- group_indices(df, group_old)

To get a similar result as above, we first define the factor levels for group_old:

df$group_old <- factor(df$group_old, levels = unique(df$group_old))
df$group_desired <- group_indices(df, group_old)

Note: group_indices assigns group numbers based on ascending order (in case of numbers) or factor level (if the variable used is factor).