watchtower watchtower - 29 days ago 18
R Question

Pipe output of more than one variable using tidyr::map or dplyr

I am building upon my earlier question on SO at Pipe output of one data.frame to another using dplyr

I want to create six correlation matrices that would let me analyze evolution of correlation in $ spent and quantity sold in the last three years. In essence, I am looking for

2 X [3X3]
type list. So far, I am able to create
3X3
list using
tidyr::map()
by making separate calls for each
Product_Type
and
Quantity
, but I have been unsuccessful in doing in one vectorized call. As you will see below, there is a lot of redundancy in my code.

Here's my data:

dput(DFile_Gather)
structure(list(Order.ID = c(456, 567, 345, 567, 2345, 8910, 8910,
789, 678, 456, 345, 8910, 234, 1234, 456), Calendar.Year = c(2015,
2015, 2016, 2015, 2017, 2015, 2015, 2016, 2015, 2015, 2016, 2015,
2016, 2016, 2015), Product_Type = c("Insurance", "Insurance",
"Tire", "Tire", "Rental", "Insurance", "Servicing", "Truck",
"Tire", "Servicing", "Truck", "Rental", "Car", "Servicing", "Tire"
), Mexican_Pesos = c(35797.32, 1916.25, 19898.62, 0, 22548.314011,
686.88, 0, 0, 0, 0, 0, 0, 0, 0, 203276.65683), Quantity = c(0.845580721440663,
0.246177053792905, 2.10266268677851, 1.89588258358317, 0.00223077008050406,
0.454640961140588, 1.92032156606277, 0.475872861771994, 0.587966920885798,
0.721024745664671, 0.696609684682582, 0.0441522564791413, 0.872232778060772,
0.343347997825813, 0.716224049425646)), .Names = c("Order.ID",
"Calendar.Year", "Product_Type", "Mexican_Pesos", "Quantity"), row.names = c(54L,
55L, 13L, 15L, 50L, 58L, 28L, 37L, 16L, 24L, 33L, 48L, 2L, 29L,
14L), class = "data.frame")


Here's my code for the first iteration : i.e. calculate correlation matrix for Product_Type

DFile_Spread_PType <- spread(DFile_Gather[-length(DFile_Gather)],key = Product_Type, value = Mexican_Pesos)

DFile<-DFile_Spread_PType
CYear <- unique(DFile$Calendar.Year)
DFile_Corr_PType <- purrr::map(CYear, ~ dplyr::filter(DFile, Calendar.Year == .)) %>%
purrr::map(~ cor(.[,colnames(DFile)[3:length(colnames(DFile))]]) ) %>%
structure(., names = CYear)


Finally, here's my code for the second iteration for Correlation Matrix by Quantity:

DFile_Spread_Qty <- spread(subset( DFile_Gather, select = -Mexican_Pesos),key = Product_Type, value = Quantity)
DFile<-DFile_Spread_Qty
DFile_Corr_Qty <- purrr::map(CYear, ~ dplyr::filter(DFile, Calendar.Year == .)) %>%
purrr::map(~ cor(.[,colnames(DFile)[3:length(colnames(DFile))]]) ) %>%
structure(., names = CYear)


As you can see above, there is too much redundancy, and the code looks really clunky.
I would sincerely appreciate if someone could help me out. I am specifically looking for two things:

1) do what I am doing above by not having any redundancy

2) If possible, get a list of 2X3X3 i.e.
Quantity
and
Product_Type
at top level, and then 3x3 correlation matrices referenced to each of the above.

I searched similar topics on SO, but I don't think there was any thread on similar topics.

Thanks in advance.

Answer

The following has no redundancy and uses no packages. Make Product_Type a factor and then split by year giving the list of years s. Now use a double Map over s and Values converting to wide form on each inner iteration using tapply and running cor.

DG <- transform(DFile_Gather, Product_Type = factor(Product_Type))
s <- split(DG, DG$Calendar.Year)
Values <- c("Mexican_Pesos", "Quantity")
By <- c("Order.ID", "Product_Type")
res <- Map(function(v) Map(function(s) cor(tapply(s[, v], s[By], c)), s), Values)
Comments