danpelota danpelota - 5 months ago 33
R Question

Assigning group ID with ddply

Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach:

> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"),
st.num=c("101", "102", "105", "102", "150"),
st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
name st.num st.name
1 Anne 101 Main
2 Bob 102 Elm
3 Chris 105 Park
4 Dan 102 Elm
5 Erin 150 Main
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df,
c("st.num", "st.name"),
function(x) transform(x, household=getString()))
> df
name st.num st.name household
1 Anne 101 Main 1EZWm4BQel
2 Bob 102 Elm xNaeuo50NS
3 Dan 102 Elm xNaeuo50NS
4 Chris 105 Park Ju1NZfWlva
5 Erin 150 Main G2gKAMZ1cU

While this works well for data frames with relatively few rows or a small number of groups, I run into performance problems with larger data sets ( > 100,000 rows) that have many unique groups.

Any suggestions to improve the speed of this task? Possibly with plyr's experimental idata.frame()? Or am I going about this all wrong?

Thanks in advance for your help.


Try using the id function (also in plyr):

df$id <- id(df[c("st.num", "st.name")], drop = TRUE)


The id function is considered deprecated since dplyr version 0.5.0. The function group_indices provides the same functionality.