AB6 AB6 - 1 month ago 15
R Question

Split the dataset in R

I have a dataset which contains Billno and Product columns in the following format:

Billno Product
1 123
1 176
2 189
3 1
3 2
3 44
3 46


etc

The output should be a table of the form:

Billno Prod1 Prod2 Prod3 Prod4
1 123 176
2 189
3 1 2 44 46


Split function works but the dataset contains more than million records. Is there an efficient way of doing this?

Answer

with dplyr:

library(dplyr)
library(tidyr)

bill <- rep(c(1,1,2,3,3,3,3),5)
prod <- rep(c(123,176,189, 1,2,44,46),5)

df <- data.frame(bill=bill, prod=prod)
#determine max product count (number of columns in result)
prodmax <- df %>% group_by(bill) %>% summarise(n = n())

df %>% group_by(bill) %>% 
  mutate(prodn = paste0("prod",row_number())) %>% 
  spread(prodn, prod) %>% 
#select columns in correct order
  select_(.dots = c('bill',paste0('prod',seq(1,max(prodmax$n)))))

results in:

      bill  prod1 prod2 prod3 prod4
(dbl) (dbl) (dbl) (dbl) (dbl)
1     1       123   176    NA    NA
2     2       189    NA    NA    NA
3     3         1     2    44    46