gnotnek - 1 year ago 82
R Question

# more efficient way to recode groups?

My goal is to recode group_old to look like group_desired:

``````group_old <- c(58,58,57,57,57,56,56,56,59,59,56)
group_desired <- c(1,1,2,2,2,3,3,3,4,4,3)
df <- data.frame(group_old, group_desired)

> df
group_old group_desired
1         58             1
2         58             1
3         57             2
4         57             2
5         57             2
6         56             3
7         56             3
8         56             3
9         59             4
10        59             4
11        56             3
``````

I was able to do it:

``````codex <- data.frame(old = unique(df\$group_old), new = 1:length(unique(df\$group_old)))

df\$group_new <- sapply(df\$group_old, FUN = function(x) codex\$new[codex\$old == x] )

> df
group_old group_desired group_new
1         58             1         1
2         58             1         1
3         57             2         2
4         57             2         2
5         57             2         2
6         56             3         3
7         56             3         3
8         56             3         3
9         59             4         4
10        59             4         4
11        56             3         3
``````

However, this code runs very slowly on a dataset with 8 millions obs and 400k groups. Is there a more efficient way to accomplish the same thing for large data?

Using `data.table`:

We group by `group_old`, and then create a new column by reference. `.GRP` is a special symbol in `data.table`. Its a simple grouping counter. It assigns 1 to the first group, 2 to the second.. and so on

``````group_old <- c(58,58,57,57,57,56,56,56,59,59,56)
df <- data.frame(group_old = group_old)

library(data.table)
setDT(df)[,group_desired := .GRP, by = group_old]

#   group_old group_desired
#1:        58             1
#2:        58             1
#3:        57             2
#4:        57             2
#5:        57             2
#6:        56             3
#7:        56             3
#8:        56             3
#9:        59             4
#10:        59             4
#11:        56             3
``````

Or using `dplyr`:

``````df\$group_desired <- group_indices(df, group_old)
``````

To get a similar result as above, we first define the factor levels for `group_old`:

``````df\$group_old <- factor(df\$group_old, levels = unique(df\$group_old))
df\$group_desired <- group_indices(df, group_old)
``````

Note: `group_indices` assigns group numbers based on ascending order (in case of numbers) or factor level (if the variable used is factor).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download