user2962956 user2962956 - 3 months ago 8
R Question

Increasing speed of variable length vector mean R

I have the following code who's aim is to take a single numeric data frame column and create a list st every two elements of the vector refer to the start and end index of the data-frame where the mean is over 0.032.

Example:

Input: [0.012,0.02,0.032,0.045,0.026,0.06,0.01]
Output [3,5,6,6]


as
mean(input(3:5))>0.032
and
mean(input(6:6))>0.032


Slightly more complex example
Input[0,0.08,0.08,0.031,0.031,-0.1]
Output [2,5]

So I can't just identify items above 0.032, and as far as I can see I need to loop over every index. (hence the while loop)

It runs very well for for "small data-frames" but I am trying to get it to run on data-frames with 2,000,000 rows, if not more.

My issue is that it runs very slowly when I get up to a large number of rows. Specifically it shoots through the values 0-100000 but slows dramatically afterwards

activityduration<-function(input)
{
datum<-as.matrix(input)
len=length(datum)
times <-c()
i<-1
while (i <len)
{
if (i>=len)
{
break
}
i<-i+1
if (datum[i]<0.032)
{
next
}
else
{
vect = c(datum[i])
x<-i
while ((mean(vect)>=0.032)){
print(i)
if (i==len)
{
break
}
i<-i+1
boolean <- TRUE
vect <- c(datum[x:i])
}
if (i==len)
{
break
}
if (boolean)
{
times <- c(times, c(x,i-1))
boolean<-FALSE
}
}
}
return(times)
}


What I assume is the issue:
I am constantly growing the vector
vect
inside the second while loop. (in some of my data
vect
can reach length = 10000). This means that I am updating
vect's
size repeatably causing the slowdown.

Fixes I have tried:
originally the input(a data-frame) was just accessed as a data-frame, I changed this to a matrix for a substantial speed increase.

I replaced else with:

{
newVal = c(datum[i])
x<-i
n<-0
meanValue<-0
while (((meanValue*n+newVal)>=(0.032*(n+1))){
print(i)
if (i==len)
{
break
}
meanValue<-(meanValue*n+newVal)/n+1
n<n+1
i<-i+1

}


Which removed the need for the vector while maintaining the same operation, however this cause an even greater slow down. most likely due to the massive number of operations performed.

I also tried: Initialising the vector
vect
with 700000 elements so that is should never need to grow, but in order to do that I needed to change the:

mean(vect)>=0.032
to either
sum(vect)/n >=0.032
or
mean(vect[!vect==0])

and this results in an even greater slowdown.

Does any one know how I can increase the speed?

Answer

Here is another algorithm to produce the results identical to what @Joseph Wood gets.

activityduration <- function(input, th) {
    epsilon <- 2*.Machine$double.eps
    a <- input - th
    s <- 0; l <- 1; r <- 1; f <- F;
    n <- length(input)
    res <- vector(mode = "integer", length = 2 * n)
    j <- 0
    for (i in 1:n) {
        s <- s + a[i]
        if (s < 0 - epsilon) {
            if (f) {
                j <- j + 1
                res[c(2 * j - 1, 2 * j)] <- c(l, r)
                f <- F
            } else {
                l <- i + 1
            }
            s <- 0
        } else {
            r <- i
            if (!f) { 
                f <- T
                l <- i
            }
        }
    }
    if (f) {
        j <- j + 1
        res[c(2 * j - 1, 2 * j)] <- c(l, r)
    }
    return(res[res > 0])
}

Tests on original examples

print(activityduration(c(0.012,0.02,0.032,0.045,0.026,0.06,0.01), 0.032))
[1] 3 7
print(activityduration(c(0,0.08,0.08,0.031,0.031,-0.1), 0.032))
[1] 2 5

Tests on @Joseph Wood's data

set.seed(1313)
options(scipen = 999)
HighSamp <- sample(51:75, 10, replace = TRUE)
MidSamp <- sample(36:50, 25, replace = TRUE)
LowSamp <- sample(11:35, 30, replace = TRUE)
MinSamp <- sample(1:10, 35, replace = TRUE)
Samp1 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 20000, replace=TRUE)/1000
Samp2 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 100000, replace=TRUE)/1000
Samp3 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 1000000, replace=TRUE)/1000


JoeTest <- VariableMean(Samp1, 0.032)
SomeTest <- activityduration(Samp1, 0.032)

all(JoeTest == SomeTest)
[1] TRUE

Performance tests

library("microbenchmark")
microbenchmark(Joseph=VariableMean(Samp1, 0.032), SomeAlgo=activityduration(Samp1, 0.032), times = 10)
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
   Joseph 38.94056 39.54052 40.59358 40.41387 41.83913 42.14377    10
 SomeAlgo 38.14466 38.53188 39.47474 38.91653 40.24965 41.72669    10
microbenchmark(Joseph=VariableMean(Samp2, 0.032), SomeAlgo=activityduration(Samp2, 0.032), times = 10)
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
   Joseph 201.9639 212.5006 226.1548 217.6033 238.1169 266.1831    10
 SomeAlgo 194.1691 200.7253 203.0191 203.6269 205.4802 211.1224    10

system.time(VariableMean(Samp3, 0.032))
   user  system elapsed 
   2.12    0.01    2.16 
system.time(activityduration(Samp3, 0.032))
   user  system elapsed 
   2.08    0.02    2.10 

Discussions
1. This algorithm has a speed gain, though very moderate;
2. The core of the algorithm is to avoid direct calculation of the mean, instead it calculates if the cumulative sum changes its sign.