Carl Carl - 1 year ago 45
C++ Question

How does arma::find_unique determine unique indices?

I am using

, and I thought it returned the index of the first occurrence of each unique value in a vector, but it appears to return something else.

Here is a toy function:

// [[Rcpp::export]]
arma::uvec test(arma::vec& x_) {
vec x=arma::sort(x_);
return arma::find_unique(x);

If I run the function in R with a simple vector
I get a vector of all the indices
which makes sense since each value is unique.

If I try something like:




All those values make sense except the first one. Why is the first unique value at index 1 and not 0? Clearly I am misunderstanding what
intends to do so I would appreciate if someone could enlighten me.

My session information


Answer Source

Okay, the following is courtesy of @nrussell, the man is amazing, and was given in the comments to this "answer." (I do not deserve the check mark nor upvotes.)

Actually, I'm pretty sure this is all just a misinterpretation of the Armadillo documentation, which never actually guarantees that a stable sort is used, as @Carl was expecting. Underneath, std::sort is being called, which is not guaranteed to be a stable sort by the C++ standard; also stated here:

"The order of equal elements is not guaranteed to be preserved."

I can demonstrate this here, replicating the "packet" structure use in the Armadillo's algorithm. My guess is that libc++ (typically used by OS X) does implement std::sort as a stable sort, while libstdc++ does not.

My turn: The stable sort, or maintaining the relative order of records with equal keys (i.e. values), is the key issue behind this question. For example, consider the following:

dog car pool dig

Sorting by the first letter with a stable sort gives us:

car dog dig pool

Because the word "dog" appeared prior to "dig" in the vector, it therefore must appear before "dig" in the output.

Sorting by the first letter with a unstable sort gives us:

car dig dog pool


car dog dig pool

The principal is relevant to numbers since each key generate is literally present elsewhere. So, we have:

2, 3, 2, 4

Thus, when the unique values are found:

2, 3, 4

The 2 can take id either 0 or 2.

As @nrussell explained, macOS since OS X Mavericks (10.9) relies by default on --stdlib=libc++ vs. the traditional --stdlib=libstdc++ flag for compiling. This was likely the reason why I was unable to replicate it as one implementation opts for stability while the other does not.

Original Answer

First, I'm not able to replicate this on macOS... (See end)

It seems as if we are able to repro this on Linux though (@nrussel). Which means at some point, there is an issue given in the linked code.

Secondly, arma::find_unique is implemented here using matrix ops with op_find_unique. The later is the key as it implements the comparators.

Thus, in short, there should be no way that is possible given that you sort the vector and the first item is always considered to be unique.

Test function

#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::uvec test(arma::vec& x_) {
    Rcpp::Rcout << "Input:" << x_.t() << std::endl;
    arma::vec x = arma::sort(x_);
    Rcpp::Rcout << "Sorted:" << x.t() << std::endl;
    arma::uvec o = arma::find_unique(x);
    Rcpp::Rcout << "Indices:" << o.t() << std::endl;
    return o;

/*** R
## [1] 2 2 1 5 7 6 7 6 4 1 5 3 1 4 4 2 8 7 7 8
## [1] 1 1 1 2 2 2 3 4 4 4 5 5 6 6 7 7 7 7 8 8

### Received
##    2.0000   2.0000   1.0000   5.0000   7.0000   6.0000   7.0000   6.0000   4.0000   1.0000   5.0000   3.0000   1.0000   4.0000   4.0000   2.0000   8.0000   7.0000   7.0000   8.0000

### Sorted
## 1.0000   1.0000   1.0000   2.0000   2.0000   2.0000   3.0000   4.0000   4.0000   4.0000   5.0000   5.0000   6.0000   6.0000   7.0000   7.0000   7.0000   7.0000   8.0000   8.0000

### Output
## 0    3    6    7   10   12   14   18