eflores89 eflores89 - 3 months ago 10
R Question

Universal function for difference between rows in tidy data?

I have roughly the following data.frame:

df <- structure(list(WK = c("W212015", "W212015", "W222015", "W222015",
"W232015", "W232015", "W372016", "W372016", "W382016", "W382016",
"W392016", "W392016"), YEAR = c(2015, 2015, 2015, 2015, 2015,
2015, 2016, 2016, 2016, 2016, 2016, 2016), SEM = c("W21", "W21",
"W22", "W22", "W23", "W23", "W37", "W37", "W38", "W38", "W39",
"W39"), DEPTO = c("FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA",
"NO FARMA", "FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA",
"NO FARMA"), UNIDADES = c(1010, 1252, 1354, 1275, 1411, 1171,
1057, 1362, 1435, 1120, 1451, 1140), VENTAS = c(128270, 172776,
173312, 204000, 189074, 165111, 130011, 226092, 231035, 181440,
220552, 197220)), .Names = c("WK", "YEAR", "SEM", "DEPTO", "UNIDADES",
"VENTAS"), class = "data.frame", row.names = c(NA, -12L))

head(df)
WK YEAR SEM DEPTO UNIDADES VENTAS
1 W212015 2015 W21 FARMA 1010 128270
2 W212015 2015 W21 NO FARMA 1252 172776
3 W222015 2015 W22 FARMA 1354 173312
4 W222015 2015 W22 NO FARMA 1275 204000
5 W232015 2015 W23 FARMA 1411 189074
6 W232015 2015 W23 NO FARMA 1171 165111


The data is in tidy format, as is usually recommended. But I keep bumping into some problems when I'm trying to calculate year-over-year changes (calculated as: t1/t0-1).

The desired outcome would be, something like this:

res <- structure(list(DEPTO = c("FARMA", "FARMA", "FARMA", "FARMA",
"FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA",
"FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA",
"FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA",
"NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA",
"NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA",
"NO FARMA", "NO FARMA"), SEM = c("W21", "W22", "W23", "W24",
"W25", "W26", "W27", "W28", "W29", "W30", "W31", "W32", "W33",
"W34", "W35", "W36", "W37", "W38", "W39", "W21", "W22", "W23",
"W24", "W25", "W26", "W27", "W28", "W29", "W30", "W31", "W32",
"W33", "W34", "W35", "W36", "W37", "W38", "W39"), YEAR = c(2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016), YOYDIFF = c(0.103960396039604, -0.231166912850812,
0.0276399716513112, -0.0818120351588911, 0.0528169014084507,
0.128721541155867, -0.105511811023622, -0.227333333333333, -0.103495544893763,
0.017304189435337, -0.0495652173913044, -0.239972807613868, -0.258199581297976,
-0.204038997214485, 0.281081081081081, -0.255927475592748, -0.0130718954248366,
-0.0231449965963241, 0.4169921875, 0.0511182108626198, -0.167058823529412,
-0.0341588385994877, 0.00649953574744666, 0.05859375, 0.10594512195122,
0.0839469808541973, 0.250877192982456, 0.233502538071066, 0.0282738095238095,
-0.0850077279752705, 0.103135313531353, 0.121684867394696, -0.0892448512585813,
0.137738853503185, -0.489478114478115, NA, -0.206236711552091,
-0.105180533751962)), .Names = c("DEPTO", "SEM", "YEAR", "YOYDIFF"
), row.names = c(NA, 38L), class = "data.frame")

head(res)
DEPTO SEM YEAR YOYDIFF
1 FARMA W21 2016 0.10396040
2 FARMA W22 2016 -0.23116691
3 FARMA W23 2016 0.02763997
4 FARMA W24 2016 -0.08181204
5 FARMA W25 2016 0.05281690
6 FARMA W26 2016 0.12872154


So, I did a year-over-year by week (SEM) and by a grouping variable (DEPTO).

Normally, I would do this with a combination of
dcast
or some filtering and
inner_joins
with
dplyr
, but I'm running into problems with speed and it really seems like it's too much inefficiency to be doing all of this for a simple calculation.

Is there any "universal" function or package to speed this up and make it more intuitive? I'm interested to see if anyone has tackled this using the super fast
data.table
or anything similar without going through various joins and operations.

UPDATE

The result (YOYDIFF) would be obtained for example by dividing the variable
UNIDADES where DEPTO="FARMA" and YEAR = "2016" and SEM = "W21" divided by UNIDADES where DEPTO="FARMA" and YEAR = "2015" and SEM = "W21" and finally subtracting 1.

UPDATE 2

The
df
is a very minimal example just to show structure and the
res
is the result of doing some of these calculations on an excel. However the original data is not all in the
df

Answer

Maybe that's not exactly what you're after but here's my guess:

library(data.table)
setDT(df)
df[,list(YOYDIFF = UNIDADES[YEAR==2016] / UNIDADES[YEAR==2015]), by=c('SEM','DEPTO')]

Which gives:

   SEM    DEPTO   YOYDIFF
1: W21    FARMA 1.0465347
2: W21 NO FARMA 1.0878594
3: W22    FARMA 1.0598227
4: W22 NO FARMA 0.8784314
5: W23    FARMA 1.0283487
6: W23 NO FARMA 0.9735269

Starting dataset (modified to be usable):

df <- structure(list(WK = c("W212015", "W212015", "W222015", "W222015", 
"W232015", "W232015", "W372016", "W372016", "W382016", "W382016", 
"W392016", "W392016"), YEAR = c(2015, 2015, 2015, 2015, 2015, 
2015, 2016, 2016, 2016, 2016, 2016, 2016), SEM = c("W21", "W21", 
"W22", "W22", "W23", "W23", "W21", "W21", "W22", "W22", "W23", 
"W23"), DEPTO = c("FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA", 
"NO FARMA", "FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA", 
"NO FARMA"), UNIDADES = c(1010, 1252, 1354, 1275, 1411, 1171, 
1057, 1362, 1435, 1120, 1451, 1140), VENTAS = c(128270, 172776, 
173312, 204000, 189074, 165111, 130011, 226092, 231035, 181440, 
220552, 197220)), .Names = c("WK", "YEAR", "SEM", "DEPTO", "UNIDADES", 
"VENTAS"), row.names = c(NA, -12L), class = "data.frame")