user2962956 - 1 year ago 50

R Question

I have the following code who's aim is to take a single numeric data frame column and create a list st every two elements of the vector refer to the start and end index of the data-frame where the mean is over 0.032.

Example:

`Input: [0.012,0.02,0.032,0.045,0.026,0.06,0.01]`

Output [3,5,6,6]

as

`mean(input(3:5))>0.032`

`mean(input(6:6))>0.032`

Slightly more complex example

Input[0,0.08,0.08,0.031,0.031,-0.1]

Output [2,5]

So I can't just identify items above 0.032, and as far as I can see I need to loop over every index. (hence the while loop)

It runs very well for for "small data-frames" but I am trying to get it to run on data-frames with 2,000,000 rows, if not more.

My issue is that it runs very slowly when I get up to a large number of rows. Specifically it shoots through the values 0-100000 but slows dramatically afterwards

`activityduration<-function(input)`

{

datum<-as.matrix(input)

len=length(datum)

times <-c()

i<-1

while (i <len)

{

if (i>=len)

{

break

}

i<-i+1

if (datum[i]<0.032)

{

next

}

else

{

vect = c(datum[i])

x<-i

while ((mean(vect)>=0.032)){

print(i)

if (i==len)

{

break

}

i<-i+1

boolean <- TRUE

vect <- c(datum[x:i])

}

if (i==len)

{

break

}

if (boolean)

{

times <- c(times, c(x,i-1))

boolean<-FALSE

}

}

}

return(times)

}

What I assume is the issue:

I am constantly growing the vector

`vect`

`vect`

`vect's`

Fixes I have tried:

originally the input(a data-frame) was just accessed as a data-frame, I changed this to a matrix for a substantial speed increase.

I replaced else with:

`{`

newVal = c(datum[i])

x<-i

n<-0

meanValue<-0

while (((meanValue*n+newVal)>=(0.032*(n+1))){

print(i)

if (i==len)

{

break

}

meanValue<-(meanValue*n+newVal)/n+1

n<n+1

i<-i+1

}

Which removed the need for the vector while maintaining the same operation, however this cause an even greater slow down. most likely due to the massive number of operations performed.

I also tried: Initialising the vector

`vect`

`mean(vect)>=0.032`

`sum(vect)/n >=0.032`

`mean(vect[!vect==0])`

and this results in an even greater slowdown.

Does any one know how I can increase the speed?

Answer Source

Here is another algorithm to produce the results identical to what @Joseph Wood gets.

```
activityduration <- function(input, th) {
epsilon <- 2*.Machine$double.eps
a <- input - th
s <- 0; l <- 1; r <- 1; f <- F;
n <- length(input)
res <- vector(mode = "integer", length = 2 * n)
j <- 0
for (i in 1:n) {
s <- s + a[i]
if (s < 0 - epsilon) {
if (f) {
j <- j + 1
res[c(2 * j - 1, 2 * j)] <- c(l, r)
f <- F
} else {
l <- i + 1
}
s <- 0
} else {
r <- i
if (!f) {
f <- T
l <- i
}
}
}
if (f) {
j <- j + 1
res[c(2 * j - 1, 2 * j)] <- c(l, r)
}
return(res[res > 0])
}
```

**Tests on original examples**

```
print(activityduration(c(0.012,0.02,0.032,0.045,0.026,0.06,0.01), 0.032))
[1] 3 7
print(activityduration(c(0,0.08,0.08,0.031,0.031,-0.1), 0.032))
[1] 2 5
```

**Tests on @Joseph Wood's data**

```
set.seed(1313)
options(scipen = 999)
HighSamp <- sample(51:75, 10, replace = TRUE)
MidSamp <- sample(36:50, 25, replace = TRUE)
LowSamp <- sample(11:35, 30, replace = TRUE)
MinSamp <- sample(1:10, 35, replace = TRUE)
Samp1 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 20000, replace=TRUE)/1000
Samp2 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 100000, replace=TRUE)/1000
Samp3 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 1000000, replace=TRUE)/1000
JoeTest <- VariableMean(Samp1, 0.032)
SomeTest <- activityduration(Samp1, 0.032)
all(JoeTest == SomeTest)
[1] TRUE
```

**Performance tests**

```
library("microbenchmark")
microbenchmark(Joseph=VariableMean(Samp1, 0.032), SomeAlgo=activityduration(Samp1, 0.032), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 38.94056 39.54052 40.59358 40.41387 41.83913 42.14377 10
SomeAlgo 38.14466 38.53188 39.47474 38.91653 40.24965 41.72669 10
microbenchmark(Joseph=VariableMean(Samp2, 0.032), SomeAlgo=activityduration(Samp2, 0.032), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 201.9639 212.5006 226.1548 217.6033 238.1169 266.1831 10
SomeAlgo 194.1691 200.7253 203.0191 203.6269 205.4802 211.1224 10
system.time(VariableMean(Samp3, 0.032))
user system elapsed
2.12 0.01 2.16
system.time(activityduration(Samp3, 0.032))
user system elapsed
2.08 0.02 2.10
```

**Discussions**

1. This algorithm has a speed gain, though very moderate;

2. The core of the algorithm is to avoid direct calculation of the mean, instead it calculates if the cumulative sum changes its sign.