skan - 6 months ago 39

R Question

I would like to know how many times each variable changes within each group and later add the result for all groups.

I've found this way:

`mi[,lapply(.SD, function(x) sum(x != shift(x),`

na.rm=T) ), by = ID][,-1][,lapply(.SD,sum, na.rm=T)]

It works, it produces the proper result but it's really slow in my large datatable.

I would like to do both operations inside the same lapply (or something faster and more compact), but the first one is done by group, the second isn't.

It could be written in an easier way (maybe not always)

`mi[,lapply(.SD, function(x) sum(x != shift(x),`

na.rm=T) )] [,-1]-mi[,length(unique(ID))]+1

But it's still slow and needs a lot of memory.

Any other idea?

I've also tried with diffs instead of shift, but it becomes more difficult.

Here you have a dummy example:

`mi <- data.table(ID=rep(1:3,each=4) , year=rep(1:4, times=3),`

VREP=rep(1:3,each=4) , VDI=rep(1:4, times=3), RAN=sample(12))

mi <- rbind(mi, data.table(4,1,1,1,0), use.names=F)

Big example for benchmark

`mi <- as.data.table(matrix(sample(0:100,10000000,`

replace=T), nrow=100000, ncol=100))

mi[,ID := rep(1:1000,each=100)]

My problem is that the true dataset is much bigger, it's in the limit of memory size, then I've configured R to be able to use more memory using the pagefile, and it makes many operations slow.

I know I could do it splitting the file and joining it again, but sometimes that makes things more difficult or some operations are not splittable.

Answer

Your second method produces incorrect results, so is not a fair comparison point. Here's an optimized version of alexis_laz's suggestion instead:

```
setorder(mi, ID)
setDT(Map(`!=`, mi, shift(mi)))[,
lapply(lapply(.SD, `&`, !ID), sum, na.rm = T), .SDcols = -"ID"]
# year VREP VDI RAN
#1: 9 0 9 9
```

On your bigger sample:

```
setorder(mi, ID)
microbenchmark(method1(), alexis_laz(), eddi(), times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# method1() 7336.1830 7510.9543 7932.0476 8150.3197 8207.2181 8455.563 5
# alexis_laz() 1350.0338 1492.3793 1509.0790 1492.5426 1577.3318 1633.107 5
# eddi() 400.3999 475.6908 494.5805 504.6163 524.2077 567.988 5
```