watchtower - 10 months ago 34
R Question

# Sum values using dplyr in R for all combinations of variables

I am relatively new to R programming so I apologize if this question is too basic. I have transactions that show revenue earned from six different types of products. There are three years of transactions. My objective is to find out the sum of products sold for all different combinations of products, which would be

`2^6 - 1 = 64 - 1 = 63`
, for every year. Meaning, I would have
`63*3 = 189`
combinations.

For sake of simplicity, I have created test data using only three variables because I wrote a program using
`while`
loop for one year, which stinks. My objective is to show what I am trying to accomplish. Notwithstanding, I have posted random sample from my original file below.

Here's the test data with only three variables
`Car`
,
`Tire`
, and
`Services`
and
`while`
loop to show you what I am looking for :

``````    dput(Sample_File)
structure(list(Order.ID = c(171, 173, 132, 174, 132, 174, 132,
174, 174), Fiscal.Year = c(2017, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2018), Car = c(2, 2, 3, 1, 0, 0, 0, 0, 1), Tire = c(0,
0, 0, 1, 0, 1, 0, 1, 1), Services = c(3, 1, 4, 0, 4, 1, 4, 0,
0)), .Names = c("Order.ID", "Fiscal.Year", "Car", "Tire", "Services"
), row.names = c(NA, 9L), class = "data.frame")
``````

Here's my code:

``````  i<-1
Csum <- matrix(rep(0,21),nrow = 7,ncol = 3)
# Row 1 is used when C is ON; T is ON ; S is ON
# Row 2 is used when C is ON; T is ON ; S is OFF
# Row 3 is used when C is ON; T is OFF ; S is ON
# Row 4 is used when C is OFF; T is ON ; S is ON
# Row 5 is used when C is ON; T is OFF ; S is OFF
# Row 6 is used when C is OFF; T is ON ; S is OFF
# Row 7 is used when C is OFF; T is OFF ; S is ON

while (i <= length(Sample_File\$Order.ID))
{
if (Sample_File\$Fiscal.Year[i]!=2016)
{
i<-i+1
next
}
if (Sample_File\$Car[i]!=0 & Sample_File\$Tire[i]!=0 & Sample_File\$Services[i]!=0)#1
{
Csum[1,1] <- Csum[1,1] + Sample_File\$Car[i]
Csum[1,2] <- Csum[1,2] + Sample_File\$Tire[i]
Csum[1,3] <- Csum[1,3] + Sample_File\$Services[i]

}
else if (Sample_File\$Car[i]!=0 & Sample_File\$Tire[i]!=0 & Sample_File\$Services[i]==0) #2
{
Csum[2,1] <- Csum[2,1] + Sample_File\$Car[i]
Csum[2,2] <- Csum[2,2] + Sample_File\$Tire[i]
Csum[2,3] <- Csum[2,3] + 0
}
else if(Sample_File\$Car[i]!=0 & Sample_File\$Tire[i]==0 & Sample_File\$Services[i]!=0) #3
{

Csum[3,1] <- Csum[3,1] + Sample_File\$Car[i]
Csum[3,2] <- Csum[3,2] + 0
Csum[3,3] <- Csum[3,3] + Sample_File\$Services[i]
}
else if(Sample_File\$Car[i]==0 & Sample_File\$Tire[i]!=0 & Sample_File\$Services[i]!=0) #4
{
Csum[4,1] <- Csum[4,1] + 0
Csum[4,2] <- Csum[4,2] + Sample_File\$Tire[i]
Csum[4,3] <- Csum[4,3] + Sample_File\$Services[i]
}
else if(Sample_File\$Car[i]!=0 & Sample_File\$Tire[i]==0 & Sample_File\$Services[i]==0) #5
{
Csum[5,1] <- Csum[5,1] + Sample_File\$Car[i]
Csum[5,2] <- Csum[5,2] + 0
Csum[5,3] <- Csum[5,3] + 0
}
else if(Sample_File\$Car[i]==0 & Sample_File\$Tire[i]!=0 & Sample_File\$Services[i]==0)#6
{
Csum[6,1] <- Csum[6,1] + 0
Csum[6,2] <- Csum[6,2] + Sample_File\$Tire[i]
Csum[6,3] <- Csum[6,3] + 0
}
else #7
{
Csum[7,1] <- Csum[7,1] + 0
Csum[7,2] <- Csum[7,2] + 0
Csum[7,3] <- Csum[7,3] + Sample_File\$Services[i]
}
i<-i+1
}
``````

I have written the code to handle only one year because it was extremely painful to replicate this code for three years. I am looking for a solution that would create a list of 3 data frames, each for three years.

Here's a random sample of size 10 with six variables from original file.

``````dput(Sample_File_Random)
structure(list(Order.ID = c(171, 173, 132, 174, 169, 175, 163,
186, 178, 121), Fiscal.Year = c(2016, 2016, 2017, 2016, 2015,
2016, 2015, 2015, 2015, 2017), Car = c(2, 0, 3, 0, 0, 0, 0, 5346.25,
0, 0), Tire = c(0, 0, 0, 8691.55800460666, 3198, 5, 2, 0, 2,
3282.18), Services = c(3, 0, 4, 0, 0, 0, 0, 0, 0, 0), Insurance = c(4,
0, 0, 4, 0, 4, 0, 0, 0, 0), Accessories = c(94.3, 3749.8, 9308.65,
0, 2, 0, 1, 633.75, 51.44, 0), Finance = c(0, 0, 0, 4, 0, 14800,
0, 0, 0, 0)), .Names = c("Order.ID", "Fiscal.Year", "Car", "Tire",
"Services", "Insurance", "Accessories", "Finance"), row.names = c(NA,
10L), class = "data.frame")
``````

I am really stuck so I would sincerely appreciate any help with vectorizing this..

@ Ronak shah's request: Here's the expected output for
`Sample_File_Random`

``````Output_File
Fiscal.Year     Car     Tire Services Insurance Accessories Finance
1        2015    0.00 3202.000        0         0       54.44       0
2        2015 5346.25    0.000        0         0      633.75       0
3        2016    2.00    0.000        3         4       94.30       0
4        2016    0.00    0.000        0         0     3749.80       0
5        2016    0.00 8696.558        0         8        0.00   14804
6        2017    3.00    0.000        4         0     9308.65       0
7        2017    0.00 3282.180        0         0        0.00       0
``````

Here is a compact & expressive `dplyr` solution, which proceeds in three steps:

1. create the indicators for whether each of the services is in the basket or not
2. group by the year, and the combinations of the indicators
3. sum the service values by the grouping variables

Here is the code that does this:

``````df_foo %>%
# 1. create the combinations of whether each of the
#   products is in the basket or not
mutate_each(
funs(In_Basket = . > 0), Car:Services
) %>%
# 2. group by the year and the basket service indicators
group_by_(.dots = c("Fiscal.Year", grep("_In_Basket", names(.), value = TRUE))) %>%
# 3. sum the service values
summarise_each(
funs(sum(., na.rm = TRUE)), Car:Services
)
``````

This gives the output:

``````Source: local data frame [7 x 7]

<dbl>         <lgl>          <lgl>              <lgl> <dbl> <dbl>    <dbl>
1        2016         FALSE          FALSE               TRUE     0     0        8
2        2016         FALSE           TRUE              FALSE     0     1        0
3        2016         FALSE           TRUE               TRUE     0     1        1
4        2016          TRUE          FALSE               TRUE     5     0        5
5        2016          TRUE           TRUE              FALSE     1     1        0
6        2017          TRUE          FALSE               TRUE     2     0        3
7        2018          TRUE           TRUE              FALSE     1     1        0
``````