Clarinetist - 1 year ago 76
R Question

# Finding the number of values above a value and less than a value in a df column without using a loop

Consider the following:

``````df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))

count_above <- function(vector)
{
counts <- vector()
counts[1] <- 0
for (i in 2:length(vector))
{
temp <- vector[1:i]
counts <- c(counts, sum(temp < vector[i]))
}
return(counts)
}
``````

This gives me the correct output:

``````count_above(df\$X)
[1] 0 1 1 0 2
``````

For instance, the (column) vector here is

``````5000
6000
5500
5000
5300
``````

At the very top
`5000`
, there are no values above it. So this gives value
`0`
.

At the
`6000`
, there is one value which is above it and is less than
`6000`
: the
`5000`
. So this gives value
`1`
.

At the
`5500`
, there are two values above it, one of which is less than
`5500`
, so this gives value
`1`
, and so forth.

Is there any way I can write this out without using a loop?

Another approach, quite similar to aichao's solution (but a bit shorter)

``````X <- c(5000, 6000, 5500, 5000, 5300)
indices <- 1:length(X)
count_above <- colSums(outer(X, X, "<") & outer(indices, indices, "<"))
## [1] 0 1 1 0 2
``````

Edit (Performance): Perhaps my idea was selected as the accepted answer because it is short and self-explaining code - but be careful to use it on large vectors! It's the slowest approach of all the solutions suggested here! Similar to that what dracodoc did, I also did a microbenchmark. But I used a random generated vector of 3000 values to get more significant run times:

``````count_above_loop <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}

count_above_outer <- function(X) {
indices <- 1:length(X)
colSums(outer(X, X, "<") & outer(indices, indices, "<"))
}

count_above_apply <- function(X) {
sapply(seq_len(length(X)), function(i) sum(X[i:1] < X[i]))
}

X <- runif(3000)

microbenchmark::microbenchmark(count_above_loop(X),
count_above_apply(X),
count_above_outer(X), times = 10)

Unit: milliseconds
expr       min        lq      mean    median        uq       max neval cld
count_above_loop(X)  56.27923  58.17195  62.07571  60.08123  63.92010  77.31658    10  a
count_above_apply(X)  54.41776  55.07511  57.12006  57.22372  58.61982  59.95037    10  a
count_above_outer(X) 121.12352 125.56072 132.45728 130.08141 137.08873 154.28419    10   b
``````

We see that the apply approach on a large vector and without the overhead of a data frame is slightly faster than the for-loop.

My outer product approach takes more than double the time.

So I would recommend to use the for-loop - it's also readable and faster. My approach might be considered if you want to have provable correct code (as this one-liner is quite near to a specification of the problem)

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download