 Mr.ecos -4 years ago 251
R Question

# Assign unique ID to distinct values within Group with dplyr

Problem: I need to make a unique ID field for data that has two levels of grouping. In the example code here, it is

`Emp`
and
`Color`
. The ID needs to be structured as:

`Emp`
+ unique number of each
`Color`
+ sequential number for duplicated
`Colors`
.

These values are separated by periods.

Example data:

``````dat <- data.frame(Emp = c("A","A","A","B","B","C"),
Color = c("Red","Green","Green","Orange","Yellow","Brown"),
stringsAsFactors = FALSE)
``````

The ID is supposed to appear as this:

``````ID <- c("A.01.001", "A.02.001", "A.02.002", "B.01.001", "B.02.001", "C.01.001")
``````

ID
 "A.01.001" "A.02.001" "A.02.002" "B.01.001" "B.02.001" "C.01.001"

The three character suffix to the ID to record the duplicates can be done as:

`````` group_by(dat, Emp, Color) %>%
``````

But I am unable to assign sequential numbers to the unique occurrence of
`Color`
with each
`Emp`
group.

I prefer a dplyr solution, but any method would be appreciated. Jaap

Using `data.table` and `sprintf`:

``````library(data.table)
setDT(dat)[, ID := sprintf('%s.%02d.%03d',
Emp, rleid(Color), rowid(rleid(Color))),
by = Emp]
``````

you get:

``````> dat
Emp  Color       ID
1:   A    Red A.01.001
2:   A  Green A.02.001
3:   A  Green A.02.002
4:   B Orange B.01.001
5:   B Yellow B.02.001
6:   C  Brown C.01.001
``````

How this works:

• You convert `dat` to a `data.table` with `setDT()`
• Group by `Emp`.
• And create the `ID`-variable with the `sprintf`-function. With `sprintf` you paste several vector easily together according to a specified format.
• The use of `:=` means that the `data.table` is updated by reference.
• `%s` indicates that a string is to be used in the first part (which is `Emp`). `%02d` & `%03d` indicates that a number needs to have two or three digits with a leading zero(s) when needed. The dots in between will taken literally and thus in cluded in the resulting string.

Adressing the comment of @jsta, if the values in the `Color`-column are not sequential you can use:

``````setDT(dat)[, r := as.integer(factor(Color, levels = unique(Color))), by = Emp
][, ID := sprintf('%s.%02d.%03d',
Emp, r, rowid(r)),
by = Emp][, r:= NULL]
``````

This will also maintain the order in which the `Color` column is presented. Instead of `as.integer(factor(Color, levels = unique(Color)))` you can also use `match(Color, unique(Color))` as shown by akrun.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download