Polar Bear Polar Bear - 3 months ago 69
R Question

Using spread with duplicate identifiers for rows

I have a long form dataframe that have multiple entries for same date and person.

jj <- data.frame(month=rep(1:3,4),
student=rep(c("Amy", "Bob"), each=6),
A=c(9, 7, 6, 8, 6, 9, 3, 2, 1, 5, 6, 5),
B=c(6, 7, 8, 5, 6, 7, 5, 4, 6, 3, 1, 5))


I want to convert it to wide form and make it like this:



month Amy.A Bob.A Amy.B Bob.B
1
2
3
1
2
3
1
2
3
1
2
3







My question is very similar to this. I have used the given code in the answer :

kk<-jj%>% gather(variable, value, -(month:student)) %>% unite(temp, student, variable) %>% spread(temp, value)


but it gives following error:


Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6), (13, 16), (14, 17), (15, 18), (7, 10), (8, 11), (9, 12), (19, 22), (20, 23), (21, 24)


Thanks in advance.
Note: I don't want to delete multiple entries.

Answer

The issue is the two columns for both A and B. If we can make that one value column, we can spread the data as you would like. Take a look at the output for jj_melt when you use the code below.

library(reshape2)
jj_melt <- melt(jj, id=c("month", "student"))
jj_spread <- dcast(jj_melt, month ~ student + variable, value.var="value", fun=sum)
#   month Amy_A Amy_B Bob_A Bob_B
# 1     1    17    11     8     8
# 2     2    13    13     8     5
# 3     3    15    15     6    11

I won't mark this as a duplicate since the other question did not summarize by sum, but the data.table answer could help with one additional argument, fun=sum:

library(data.table)
dcast(setDT(jj), month ~ student, value.var=c("A", "B"), fun=sum)
#    month A_sum_Amy A_sum_Bob B_sum_Amy B_sum_Bob
# 1:     1        17         8        11         8
# 2:     2        13         8        13         5
# 3:     3        15         6        15        11

If you would like to use the tidyr solution, combine it with dcast to summarize by sum.

as.data.frame(jj)
library(tidyr)
jj %>% 
  gather(variable, value, -(month:student)) %>%
  unite(temp, student, variable) %>%
  dcast(month ~ temp, fun=sum)
#   month Amy_A Amy_B Bob_A Bob_B
# 1     1    17    11     8     8
# 2     2    13    13     8     5
# 3     3    15    15     6    11

Edit

Based on your new requirements, I have added an activity column.

library(dplyr)
jj %>% group_by(month, student) %>% 
  mutate(id=1:n()) %>%
  melt(id=c("month", "id", "student")) %>%
  dcast(... ~ student + variable, value.var="value")
#   month id Amy_A Amy_B Bob_A Bob_B
# 1     1  1     9     6     3     5
# 2     1  2     8     5     5     3
# 3     2  1     7     7     2     4
# 4     2  2     6     6     6     1
# 5     3  1     6     8     1     6
# 6     3  2     9     7     5     5

The other solutions can also be used. Here I added an optional expression to arrange the final output by activity number:

library(tidyr)
jj %>% 
  gather(variable, value, -(month:student)) %>%
  unite(temp, student, variable) %>%
  group_by(temp) %>%
  mutate(id=1:n()) %>%
  dcast(... ~ temp) %>%
  arrange(id)
#   month id Amy_A Amy_B Bob_A Bob_B
# 1     1  1     9     6     3     5
# 2     2  2     7     7     2     4
# 3     3  3     6     8     1     6
# 4     1  4     8     5     5     3
# 5     2  5     6     6     6     1
# 6     3  6     9     7     5     5

The data.table syntax is compact because it allows for multiple value.var columns and will take care of the spread for us. We can then skip the melt -> cast process.

library(data.table)
setDT(jj)[, activityID := rowid(student)]
dcast(jj, ... ~ student, value.var=c("A", "B"))
#    month activityID A_Amy A_Bob B_Amy B_Bob
# 1:     1          1     9     3     6     5
# 2:     1          4     8     5     5     3
# 3:     2          2     7     2     7     4
# 4:     2          5     6     6     6     1
# 5:     3          3     6     1     8     6
# 6:     3          6     9     5     7     5
Comments