AB6 - 10 months ago 54

R Question

I have a dataset which contains Billno and Product columns in the following format:

`Billno Product`

1 123

1 176

2 189

3 1

3 2

3 44

3 46

etc

The output should be a table of the form:

`Billno Prod1 Prod2 Prod3 Prod4`

1 123 176

2 189

3 1 2 44 46

Split function works but the dataset contains more than million records. Is there an efficient way of doing this?

Answer Source

with dplyr:

```
library(dplyr)
library(tidyr)
bill <- rep(c(1,1,2,3,3,3,3),5)
prod <- rep(c(123,176,189, 1,2,44,46),5)
df <- data.frame(bill=bill, prod=prod)
#determine max product count (number of columns in result)
prodmax <- df %>% group_by(bill) %>% summarise(n = n())
df %>% group_by(bill) %>%
mutate(prodn = paste0("prod",row_number())) %>%
spread(prodn, prod) %>%
#select columns in correct order
select_(.dots = c('bill',paste0('prod',seq(1,max(prodmax$n)))))
```

results in:

```
bill prod1 prod2 prod3 prod4
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 123 176 NA NA
2 2 189 NA NA NA
3 3 1 2 44 46
```