 user2962956 - 2 years ago 113
R Question

# Increasing speed of variable length vector mean R

I have the following code who's aim is to take a single numeric data frame column and create a list st every two elements of the vector refer to the start and end index of the data-frame where the mean is over 0.032.

Example:

``````Input: [0.012,0.02,0.032,0.045,0.026,0.06,0.01]
Output [3,5,6,6]
``````

as
`mean(input(3:5))>0.032`
and
`mean(input(6:6))>0.032`

Slightly more complex example
Input[0,0.08,0.08,0.031,0.031,-0.1]
Output [2,5]

So I can't just identify items above 0.032, and as far as I can see I need to loop over every index. (hence the while loop)

It runs very well for for "small data-frames" but I am trying to get it to run on data-frames with 2,000,000 rows, if not more.

My issue is that it runs very slowly when I get up to a large number of rows. Specifically it shoots through the values 0-100000 but slows dramatically afterwards

``````activityduration<-function(input)
{
datum<-as.matrix(input)
len=length(datum)
times <-c()
i<-1
while (i <len)
{
if (i>=len)
{
break
}
i<-i+1
if (datum[i]<0.032)
{
next
}
else
{
vect = c(datum[i])
x<-i
while ((mean(vect)>=0.032)){
print(i)
if (i==len)
{
break
}
i<-i+1
boolean <- TRUE
vect <- c(datum[x:i])
}
if (i==len)
{
break
}
if (boolean)
{
times <- c(times, c(x,i-1))
boolean<-FALSE
}
}
}
return(times)
}
``````

What I assume is the issue:
I am constantly growing the vector
`vect`
inside the second while loop. (in some of my data
`vect`
can reach length = 10000). This means that I am updating
`vect's`
size repeatably causing the slowdown.

Fixes I have tried:
originally the input(a data-frame) was just accessed as a data-frame, I changed this to a matrix for a substantial speed increase.

I replaced else with:

``````{
newVal = c(datum[i])
x<-i
n<-0
meanValue<-0
while (((meanValue*n+newVal)>=(0.032*(n+1))){
print(i)
if (i==len)
{
break
}
meanValue<-(meanValue*n+newVal)/n+1
n<n+1
i<-i+1

}
``````

Which removed the need for the vector while maintaining the same operation, however this cause an even greater slow down. most likely due to the massive number of operations performed.

I also tried: Initialising the vector
`vect`
with 700000 elements so that is should never need to grow, but in order to do that I needed to change the:

`mean(vect)>=0.032`
to either
`sum(vect)/n >=0.032`
or
`mean(vect[!vect==0])`

and this results in an even greater slowdown.

Does any one know how I can increase the speed? foehn

Here is another algorithm to produce the results identical to what @Joseph Wood gets.

``````activityduration <- function(input, th) {
epsilon <- 2*.Machine\$double.eps
a <- input - th
s <- 0; l <- 1; r <- 1; f <- F;
n <- length(input)
res <- vector(mode = "integer", length = 2 * n)
j <- 0
for (i in 1:n) {
s <- s + a[i]
if (s < 0 - epsilon) {
if (f) {
j <- j + 1
res[c(2 * j - 1, 2 * j)] <- c(l, r)
f <- F
} else {
l <- i + 1
}
s <- 0
} else {
r <- i
if (!f) {
f <- T
l <- i
}
}
}
if (f) {
j <- j + 1
res[c(2 * j - 1, 2 * j)] <- c(l, r)
}
return(res[res > 0])
}
``````

Tests on original examples

``````print(activityduration(c(0.012,0.02,0.032,0.045,0.026,0.06,0.01), 0.032))
 3 7
print(activityduration(c(0,0.08,0.08,0.031,0.031,-0.1), 0.032))
 2 5
``````

Tests on @Joseph Wood's data

``````set.seed(1313)
options(scipen = 999)
HighSamp <- sample(51:75, 10, replace = TRUE)
MidSamp <- sample(36:50, 25, replace = TRUE)
LowSamp <- sample(11:35, 30, replace = TRUE)
MinSamp <- sample(1:10, 35, replace = TRUE)
Samp1 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 20000, replace=TRUE)/1000
Samp2 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 100000, replace=TRUE)/1000
Samp3 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 1000000, replace=TRUE)/1000

JoeTest <- VariableMean(Samp1, 0.032)
SomeTest <- activityduration(Samp1, 0.032)

all(JoeTest == SomeTest)
 TRUE
``````

Performance tests

``````library("microbenchmark")
microbenchmark(Joseph=VariableMean(Samp1, 0.032), SomeAlgo=activityduration(Samp1, 0.032), times = 10)
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
Joseph 38.94056 39.54052 40.59358 40.41387 41.83913 42.14377    10
SomeAlgo 38.14466 38.53188 39.47474 38.91653 40.24965 41.72669    10
microbenchmark(Joseph=VariableMean(Samp2, 0.032), SomeAlgo=activityduration(Samp2, 0.032), times = 10)
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
Joseph 201.9639 212.5006 226.1548 217.6033 238.1169 266.1831    10
SomeAlgo 194.1691 200.7253 203.0191 203.6269 205.4802 211.1224    10

system.time(VariableMean(Samp3, 0.032))
user  system elapsed
2.12    0.01    2.16
system.time(activityduration(Samp3, 0.032))
user  system elapsed
2.08    0.02    2.10
``````

Discussions
1. This algorithm has a speed gain, though very moderate;
2. The core of the algorithm is to avoid direct calculation of the mean, instead it calculates if the cumulative sum changes its sign.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download