Clarinetist - 4 months ago 18

R Question

Consider the following:

`df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))`

count_above <- function(vector)

{

counts <- vector()

counts[1] <- 0

for (i in 2:length(vector))

{

temp <- vector[1:i]

counts <- c(counts, sum(temp < vector[i]))

}

return(counts)

}

This gives me the correct output:

`count_above(df$X)`

[1] 0 1 1 0 2

For instance, the (column) vector here is

`5000`

6000

5500

5000

5300

At the very top

`5000`

`0`

At the

`6000`

`6000`

`5000`

`1`

At the

`5500`

`5500`

`1`

Answer

Another approach, quite similar to aichao's solution (but a bit shorter)

```
X <- c(5000, 6000, 5500, 5000, 5300)
indices <- 1:length(X)
count_above <- colSums(outer(X, X, "<") & outer(indices, indices, "<"))
## [1] 0 1 1 0 2
```

**Edit (Performance):** Perhaps my idea was selected as the accepted answer because it is short and self-explaining code - but be careful to use it on large vectors! It's the slowest approach of all the solutions suggested here! Similar to that what dracodoc did, I also did a microbenchmark. But I used a random generated vector of 3000 values to get more significant run times:

```
count_above_loop <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
count_above_outer <- function(X) {
indices <- 1:length(X)
colSums(outer(X, X, "<") & outer(indices, indices, "<"))
}
count_above_apply <- function(X) {
sapply(seq_len(length(X)), function(i) sum(X[i:1] < X[i]))
}
X <- runif(3000)
microbenchmark::microbenchmark(count_above_loop(X),
count_above_apply(X),
count_above_outer(X), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
count_above_loop(X) 56.27923 58.17195 62.07571 60.08123 63.92010 77.31658 10 a
count_above_apply(X) 54.41776 55.07511 57.12006 57.22372 58.61982 59.95037 10 a
count_above_outer(X) 121.12352 125.56072 132.45728 130.08141 137.08873 154.28419 10 b
```

We see that the apply approach on a large vector and without the overhead of a data frame is slightly faster than the for-loop.

My outer product approach takes more than double the time.

So I would recommend to use the for-loop - it's also readable and faster. My approach might be considered if you want to have provable correct code (as this one-liner is quite near to a specification of the problem)