Clarinetist Clarinetist - 2 months ago 15
R Question

Finding the number of values above a value and less than a value in a df column without using a loop

Consider the following:

df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))

count_above <- function(vector)
{
counts <- vector()
counts[1] <- 0
for (i in 2:length(vector))
{
temp <- vector[1:i]
counts <- c(counts, sum(temp < vector[i]))
}
return(counts)
}


This gives me the correct output:

count_above(df$X)
[1] 0 1 1 0 2


For instance, the (column) vector here is

5000
6000
5500
5000
5300


At the very top
5000
, there are no values above it. So this gives value
0
.

At the
6000
, there is one value which is above it and is less than
6000
: the
5000
. So this gives value
1
.

At the
5500
, there are two values above it, one of which is less than
5500
, so this gives value
1
, and so forth.

Is there any way I can write this out without using a loop?

Answer

Another approach, quite similar to aichao's solution (but a bit shorter)

X <- c(5000, 6000, 5500, 5000, 5300)
indices <- 1:length(X)
count_above <- colSums(outer(X, X, "<") & outer(indices, indices, "<"))
## [1] 0 1 1 0 2

Edit (Performance): Perhaps my idea was selected as the accepted answer because it is short and self-explaining code - but be careful to use it on large vectors! It's the slowest approach of all the solutions suggested here! Similar to that what dracodoc did, I also did a microbenchmark. But I used a random generated vector of 3000 values to get more significant run times:

count_above_loop <- function(v)
{
  counts <- integer(length = length(v))
  counts[1] <- 0
  for (i in 2:length(v))
  {
    counts[i] <- sum(v[1:(i-1)] < v[i])
  }
  return(counts)
}

count_above_outer <- function(X) {
  indices <- 1:length(X)
  colSums(outer(X, X, "<") & outer(indices, indices, "<"))
}

count_above_apply <- function(X) {
  sapply(seq_len(length(X)), function(i) sum(X[i:1] < X[i]))
}

X <- runif(3000)

microbenchmark::microbenchmark(count_above_loop(X), 
                               count_above_apply(X),
                               count_above_outer(X), times = 10)

Unit: milliseconds
                 expr       min        lq      mean    median        uq       max neval cld
  count_above_loop(X)  56.27923  58.17195  62.07571  60.08123  63.92010  77.31658    10  a 
 count_above_apply(X)  54.41776  55.07511  57.12006  57.22372  58.61982  59.95037    10  a 
 count_above_outer(X) 121.12352 125.56072 132.45728 130.08141 137.08873 154.28419    10   b

We see that the apply approach on a large vector and without the overhead of a data frame is slightly faster than the for-loop.

My outer product approach takes more than double the time.

So I would recommend to use the for-loop - it's also readable and faster. My approach might be considered if you want to have provable correct code (as this one-liner is quite near to a specification of the problem)

Comments