eflores89 - 1 year ago 58
R Question

Universal function for difference between rows in tidy data?

I have roughly the following data.frame:

``````df <- structure(list(WK = c("W212015", "W212015", "W222015", "W222015",
"W232015", "W232015", "W372016", "W372016", "W382016", "W382016",
"W392016", "W392016"), YEAR = c(2015, 2015, 2015, 2015, 2015,
2015, 2016, 2016, 2016, 2016, 2016, 2016), SEM = c("W21", "W21",
"W22", "W22", "W23", "W23", "W37", "W37", "W38", "W38", "W39",
"W39"), DEPTO = c("FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA",
"NO FARMA", "FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA",
"NO FARMA"), UNIDADES = c(1010, 1252, 1354, 1275, 1411, 1171,
1057, 1362, 1435, 1120, 1451, 1140), VENTAS = c(128270, 172776,
173312, 204000, 189074, 165111, 130011, 226092, 231035, 181440,
220552, 197220)), .Names = c("WK", "YEAR", "SEM", "DEPTO", "UNIDADES",
"VENTAS"), class = "data.frame", row.names = c(NA, -12L))

WK      YEAR SEM    DEPTO UNIDADES VENTAS
1 W212015 2015 W21    FARMA     1010 128270
2 W212015 2015 W21 NO FARMA     1252 172776
3 W222015 2015 W22    FARMA     1354 173312
4 W222015 2015 W22 NO FARMA     1275 204000
5 W232015 2015 W23    FARMA     1411 189074
6 W232015 2015 W23 NO FARMA     1171 165111
``````

The data is in tidy format, as is usually recommended. But I keep bumping into some problems when I'm trying to calculate year-over-year changes (calculated as: t1/t0-1).

The desired outcome would be, something like this:

``````res <- structure(list(DEPTO = c("FARMA", "FARMA", "FARMA", "FARMA",
"FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA",
"FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA", "FARMA",
"FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA",
"NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA",
"NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA", "NO FARMA",
"NO FARMA", "NO FARMA"), SEM = c("W21", "W22", "W23", "W24",
"W25", "W26", "W27", "W28", "W29", "W30", "W31", "W32", "W33",
"W34", "W35", "W36", "W37", "W38", "W39", "W21", "W22", "W23",
"W24", "W25", "W26", "W27", "W28", "W29", "W30", "W31", "W32",
"W33", "W34", "W35", "W36", "W37", "W38", "W39"), YEAR = c(2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016), YOYDIFF = c(0.103960396039604, -0.231166912850812,
0.0276399716513112, -0.0818120351588911, 0.0528169014084507,
0.128721541155867, -0.105511811023622, -0.227333333333333, -0.103495544893763,
0.017304189435337, -0.0495652173913044, -0.239972807613868, -0.258199581297976,
-0.204038997214485, 0.281081081081081, -0.255927475592748, -0.0130718954248366,
-0.0231449965963241, 0.4169921875, 0.0511182108626198, -0.167058823529412,
-0.0341588385994877, 0.00649953574744666, 0.05859375, 0.10594512195122,
0.0839469808541973, 0.250877192982456, 0.233502538071066, 0.0282738095238095,
-0.0850077279752705, 0.103135313531353, 0.121684867394696, -0.0892448512585813,
0.137738853503185, -0.489478114478115, NA, -0.206236711552091,
-0.105180533751962)), .Names = c("DEPTO", "SEM", "YEAR", "YOYDIFF"
), row.names = c(NA, 38L), class = "data.frame")

DEPTO SEM YEAR     YOYDIFF
1 FARMA W21 2016  0.10396040
2 FARMA W22 2016 -0.23116691
3 FARMA W23 2016  0.02763997
4 FARMA W24 2016 -0.08181204
5 FARMA W25 2016  0.05281690
6 FARMA W26 2016  0.12872154
``````

So, I did a year-over-year by week (SEM) and by a grouping variable (DEPTO).

Normally, I would do this with a combination of
`dcast`
or some filtering and
`inner_joins`
with
`dplyr`
, but I'm running into problems with speed and it really seems like it's too much inefficiency to be doing all of this for a simple calculation.

Is there any "universal" function or package to speed this up and make it more intuitive? I'm interested to see if anyone has tackled this using the super fast
`data.table`
or anything similar without going through various joins and operations.

UPDATE

The result (YOYDIFF) would be obtained for example by dividing the variable
UNIDADES where DEPTO="FARMA" and YEAR = "2016" and SEM = "W21" divided by UNIDADES where DEPTO="FARMA" and YEAR = "2015" and SEM = "W21" and finally subtracting 1.

UPDATE 2

The
`df`
is a very minimal example just to show structure and the
`res`
is the result of doing some of these calculations on an excel. However the original data is not all in the
`df`

Maybe that's not exactly what you're after but here's my guess:

``````library(data.table)
setDT(df)
``````

Which gives:

``````   SEM    DEPTO   YOYDIFF
1: W21    FARMA 1.0465347
2: W21 NO FARMA 1.0878594
3: W22    FARMA 1.0598227
4: W22 NO FARMA 0.8784314
5: W23    FARMA 1.0283487
6: W23 NO FARMA 0.9735269
``````

Starting dataset (modified to be usable):

``````df <- structure(list(WK = c("W212015", "W212015", "W222015", "W222015",
"W232015", "W232015", "W372016", "W372016", "W382016", "W382016",
"W392016", "W392016"), YEAR = c(2015, 2015, 2015, 2015, 2015,
2015, 2016, 2016, 2016, 2016, 2016, 2016), SEM = c("W21", "W21",
"W22", "W22", "W23", "W23", "W21", "W21", "W22", "W22", "W23",
"W23"), DEPTO = c("FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA",
"NO FARMA", "FARMA", "NO FARMA", "FARMA", "NO FARMA", "FARMA",
"NO FARMA"), UNIDADES = c(1010, 1252, 1354, 1275, 1411, 1171,
1057, 1362, 1435, 1120, 1451, 1140), VENTAS = c(128270, 172776,
173312, 204000, 189074, 165111, 130011, 226092, 231035, 181440,
220552, 197220)), .Names = c("WK", "YEAR", "SEM", "DEPTO", "UNIDADES",
"VENTAS"), row.names = c(NA, -12L), class = "data.frame")
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download