lll lll - 3 months ago 7
R Question

R: how to paste a small dataframe into a larger one with specific pattern

I have a data like the following:

DataA:

ID PCODE age COLOR POLISH ...
1 S 30 PINK GOLD
1 S 30 PINK GOLD
1 S 30 PINK GOLD
2 S 20 PINK RODIUM
2 S 20 PINK RODIUM
2 S 20 PINK RODIUM
...


and I have another data set containing attributes of the different products available (DataB) and I want to pass those values into the non-first row for each customer.

PCODE COLOR POLISH
S WHITE GOLD
S PINK GOLD
S PINK RODIUM


so basically for each customer and in each purchase occasion, I want to make the non-first row containing the attributes of products in dataB that has not been chosen by the customer. To be specific, the ideal result I want to get is

ID PCODE age COLOR POLISH ...
1 S 30 PINK GOLD
1 S 30 WHITE GOLD
1 S 30 PINK RODIUM
2 S 20 PINK RODIUM
2 S 20 PINK GOLD
2 S 20 WHITE GOLD


I have thought of using merge but it seems that this function will not give me the result I want, so I am not sure what I should use to achieve it.

Answer

Using a few steps, I think we can accomplish this using the plyr and dplyr packages. Note that you should load plyr prior to dplyr!

library(plyr)
library(dplyr)

# select first occurrence by ID and PCODE
# assumes your data is pre-sorted

table_A %>%
  group_by(ID, PCODE) %>%
  mutate(r_n = row_number()) %>%
  filter(r_n == 1) %>%
  select(-r_n) -> table_A1

# products not chosen by customer
# select only rows where all entries are not equal for each ID and PCODE

table_B1 <- 
  ddply(table_A1, c('ID', 'PCODE'),
      function(x) 
        table_B[sapply(1:nrow(table_B), 
                       function(y) !all(x[,names(table_B)] == table_B[y,])),])

# bind together after merging

rbind.data.frame(table_A1,
                 merge(table_A1[,c('ID','PCODE','age')], 
                       table_B1, 
                       by = c('ID', 'PCODE'))) %>%
  arrange(ID, PCODE) -> table_AB

# table_AB

#      ID PCODE   age COLOR POLISH
#   <int> <chr> <int> <chr>  <chr>
# 1     1     S    30  PINK   GOLD
# 2     1     S    30 WHITE   GOLD
# 3     1     S    30  PINK RODIUM
# 4     2     S    20  PINK RODIUM
# 5     2     S    20 WHITE   GOLD
# 6     2     S    20  PINK   GOLD

data

table_A <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), 
               PCODE = c("S", "S", "S", "S", "S", "S"), 
               age = c(30L, 30L, 30L, 20L, 20L, 20L), 
               COLOR = c("PINK", "PINK", "PINK", "PINK", "PINK", "PINK"), 
               POLISH = c("GOLD", "GOLD", "GOLD", "RODIUM", "RODIUM", "RODIUM")), 
          .Names = c("ID", "PCODE", "age", "COLOR", "POLISH"), class = "data.frame", 
          row.names = c(NA, -6L))

table_B <- structure(list(PCODE = c("S", "S", "S"), 
                          COLOR = c("WHITE", "PINK", "PINK"), 
                          POLISH = c("GOLD", "GOLD", "RODIUM")), 
                     .Names = c("PCODE", "COLOR", "POLISH"), 
                     class = "data.frame", row.names = c(NA, -3L))