Sharon Sharon - 6 months ago 23
R Question

How to efficiently split 1 data frame column to create a new data frame assigning same value to each part of category column

I have rather un-tidy data by category where sometimes there are multiple categories in a single field. I'd like to assign the same value to each category in a category entry after splitting the multi-category entry. For example, if the data is

cat <- c("A,B,C", "B", "B,C", "A,E")
val <- c(300, 350, 400, 450)
mydf <- data.frame(cat, val, stringsAsFactors = FALSE)

cat val
A,B,C 300
B 350
B,C 400
A,E 450

From the first row, I need 300 to be assigned to each category A, B & C, then from the second row 350 assigned to B, 400 assigned to each B and C from the third row and then 450 assigned to A and E in the fourth row.

I came up with a very kludge-y for loop to accomplish this, but I know this is not memory-efficient because it is using rbind on an existing data frame.

resultsdf <- data.frame(temp_cats = character(0),
temp_vals = numeric(0), stringsAsFactors = FALSE)

for(i in 1:nrow(mydf)){
temp_cats <- stringr::str_split(mydf$cat[i], ",")[[1]]
temp_vals <- rep(val[i], length(temp_cats))
temp_df <- data.frame(temp_cats, temp_vals, stringsAsFactors = FALSE)
resultsdf <- rbind(resultsdf, temp_df)

Just curious if anyone has a more elegant R-syntax way of accomplishing this to end up with

temp_cats temp_vals
1 A 300
2 B 300
3 C 300
4 B 350
5 B 400
6 C 400
7 A 450
8 E 450


1) stack/unstack Use strsplit to split the strings and then unstack and stack it to get the long form s required. Finally fix up the names and make temp_vals numeric since unstack coerced them to character. No packages are used.

s <-  stack(unstack(transform(mydf, cat = strsplit(cat, ","))))
with(s, data.frame(temp_cats = values, temp_vals = as.numeric(as.character(ind))))

2) dplyr/tidyr Another approach is to use unnest in the tidyr package:


mydf %>%
     mutate(cat = strsplit(cat, ",")) %>%
     unnest() %>%
     transmute(temp_cat = cat, temp_vals = val)