JackStat - 1 year ago 121

R Question

I have a custom rank function that I stole from here (with some modifications) :)

Rcpp sugar for rank function

The problem is that it does min ties and I need average ties

Here is what I have

`#include <Rcpp.h>`

using namespace Rcpp;

// [[Rcpp::export]]

NumericVector sort_rcpp(NumericVector x) {

std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);

std::sort(tmp.begin(), tmp.end());

return wrap(tmp);

}

// [[Rcpp::export]]

IntegerVector rank_(NumericVector x) {

return match(x, sort_rcpp(x));

}

/*** R

x <- c(1:5, 1:5)

rank(x, ties = 'average')

rank(x, ties = 'min')

rank_(x)

*/

After saving that to a file then running this gets the results

`Rcpp::sourceCpp('~/Documents/Rank.cpp')`

Which returns

`# x <- c(1:5, 1:5)`

#

# # what I need

# rank(x, ties = 'average')

# [1] 1.5 3.5 5.5 7.5 9.5 1.5 3.5 5.5 7.5 9.5

#

# # What I am getting

# rank(x, ties = 'min')

# [1] 1 3 5 7 9 1 3 5 7 9

#

# rank_(x)

# [1] 1 3 5 7 9 1 3 5 7 9

What do I need to modify in the c++ code to match the average rank function from base R?

Answer Source

This is an adapted version of RenĂ© Richter's code in shayaa's link -- the main differences being the use of `Rcpp::seq`

instead of `std::iota`

and a custom comparator that handles `NA`

comparisons:

```
#include <Rcpp.h>
class Comparator {
private:
const Rcpp::NumericVector& ref;
bool is_na(double x) const
{
return Rcpp::traits::is_na<REALSXP>(x);
}
public:
Comparator(const Rcpp::NumericVector& ref_)
: ref(ref_)
{}
bool operator()(const int ilhs, const int irhs) const
{
double lhs = ref[ilhs], rhs = ref[irhs];
if (is_na(lhs)) return false;
if (is_na(rhs)) return true;
return lhs < rhs;
}
};
// [[Rcpp::export]]
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
{
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n) {
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++) {
r[w[i + k]] = i + (n + 1) / 2.;
}
}
return r;
}
```

Verifying the results against `base::rank`

,

```
x <- c(1:7, 1:2, 1:5, 1:10)
all.equal(
rank(x, ties.method = "average"),
avg_rank(x)
)
# [1] TRUE
```

Also note that this handles `NA`

s correctly while your version does not:

```
all.equal(
rank(c(NA, x), ties.method = "average"),
avg_rank(c(NA, x))
)
# [1] TRUE
all.equal(
rank(c(NA, x), ties.method = "average"),
rank_(c(NA, x))
)
# Error: can't subset using a logical vector with NAs
```

Here is a benchmark with the vector `x`

from above:

```
microbenchmark::microbenchmark(
".Internal" = .Internal(rank(x, length(x), ties = "average")),
avg_rank(x),
"base::rank" = rank(x, ties.method = "average"),
rank_(x),
times = 1000L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# .Internal 1.283 1.711 2.029777 1.712 2.139 65.002 1000
# avg_rank(x) 2.566 3.422 4.057262 3.849 4.277 23.521 1000
# base::rank 13.685 16.251 18.145440 17.534 18.390 163.360 1000
# rank_(x) 25.659 28.653 31.203092 29.936 32.074 112.898 1000
```

And here is a benchmark with a 1e6-length vector (I didn't include `rank_`

because even a single evaluation was taking forever; see below):

```
set.seed(123)
xx <- sample(x, 1e6, TRUE)
microbenchmark::microbenchmark(
".Internal" = .Internal(rank(xx, length(xx), ties = "average")),
avg_rank(xx),
"base::rank" = rank(xx, ties.method = "average"),
times = 100L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# .Internal 302.2488 309.7977 334.7977 322.0396 347.4779 504.1041 100
# avg_rank(xx) 304.4435 309.9840 330.4902 316.7807 331.6825 427.7171 100
# base::rank 312.1196 327.3319 351.6237 343.1317 366.7316 445.9004 100
```

For larger sized vectors, the runtime of these three functions is much closer. In theory, the `.Internal`

call should always be a little faster than `base::rank`

since it foregoes additional checks that take place in the body of the latter. However, the difference is less pronounced in the second benchmark as the amount of time needed to do those checks represents a much smaller proportion of the function's total runtime.

As a side note, one of the obvious reasons your code is so inefficient is because you are creating vectors in the body of your loop:

```
for (int i = 0; i < n; ++i) {
NumericVector xVal = x[x == x[i]]; // here
IntegerVector Match = match(xVal, sortX); // here
double minM = min(Match);
int matchSize = Match.size();
NumericVector aves = NumericVector(matchSize); // here
for (int k = 0; k < matchSize; ++k) {
aves[k] = minM + (k);
}
ranks[i] = sum(aves)/Match.size();
}
```

Both `avg_rank`

and R's implementation (I believe, you can double check the source code) are only creating two additional vectors, regardless of the size of the input. Your function is creating **2 + 3 * N vectors** (!!!), where N is the size of your input.

And finally, not really related to performance, but instead of sorting like this (which will **not** handle `NA`

s correctly),

```
NumericVector sort_rcpp(NumericVector x) {
std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
std::sort(tmp.begin(), tmp.end());
return wrap(tmp);
}
```

just use the tools Rcpp provides:

```
NumericVector RcppSort(NumericVector x) {
return Rcpp::clone(x).sort();
}
```