Brandon Bertelsen Brandon Bertelsen - 3 months ago 7
R Question

Review cpp code causing segfault

I have some cpp code that's running within an R function that's called about ~80k times. It's test suite is comprehensive and passing. It seems to be running fine for the first 60k times it's called and then somewhere in the middle, I get a segfault.


*** Error in `/usr/lib/R/bin/exec/R': malloc(): memory corruption: 0x00000000047150f0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x77725)[0x7f684462e725]
/lib/x86_64-linux-gnu/libc.so.6(+0x819be)[0x7f68446389be]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f684463a5a4]
/usr/lib/R/lib/libR.so(Rf_allocVector3+0x70d)[0x7f6844cd617d]
... # more


Here's some of my code as an example, can you see anything wrong with it?

Return a TRUE FALSE vector where leading NAs are marked as TRUE

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
LogicalVector leading_na(IntegerVector x) {
int n = x.size();
LogicalVector leading_na(n);

int i = 0;
while(x[i] == NA_INTEGER) {
leading_na[i] = TRUE;
i++;
}
return leading_na;
}


Return a TRUE FALSE vector where trailing NAs are marked as TRUE

// [[Rcpp::export]]
LogicalVector trailing_na(IntegerVector x) {
LogicalVector trailing_na = leading_na(rev(x));
return rev(trailing_na);
}


Copies the functionality of
na.locf(x, na.rm = TRUE)
from the zoo package:

// [[Rcpp::export]]
IntegerVector na_locf(IntegerVector x) {
int n = x.size();
LogicalVector lna = leading_na(x);

for(int i = 0; i<n; i++) {
if((i > 0) & (x[i] == NA_INTEGER) & (lna[i] != TRUE)) {
x[i] = x[i-1];
}
}
return x;
}


Return the last position in a vector where there was a number:

// [[Rcpp::export]]
int max_x_pos(IntegerVector x) {
IntegerVector y = rev(x);
int n = x.size();
LogicalVector leading_na(n);

int i = 0;
while(y[i] == NA_INTEGER) {
i++;
}

return n-i;
}

Answer

To address the primary issue, you are getting seemingly random segfaults because your code contains undefined behavior -- specifically, an array boundary violation. Since you noted earlier that you are fairly new to C++, it would be worthwhile for you to read through at least the first answer to this question which discusses this particular mistake. UB can be a slippery concept to wrap your head around for people coming to C or C++ from other languages, primarily because it does not always manifest itself in the form of an error. The behavior is literally undefined, so there is no way to know what the result will be, nor should you expect the behavior to be consistent across platforms, compilers, or even compiler versions.

I will use your leading_na function to demonstrate, but the max_x_pos function has the same problem:

// [[Rcpp::export]]
Rcpp::LogicalVector leading_na(Rcpp::IntegerVector x) {
    int n = x.size();
    Rcpp::LogicalVector leading_na(n);

    int i = 0;
    while (x[i] == NA_INTEGER) {
        // ^^^^  
        Rcpp::Rcout << i << "\n";

        leading_na[i] = TRUE;
        i++;
    }

    return leading_na;
} 

Since there isn't anything enforcing the constraint i < n, when x contains only NA elements, the code proceeds to evaluate x[n] (and possibly subsequent indices), which is undefined. However, this runs just fine on my machine for smaller vectors (I finally managed to make it crash with larger input), which is precisely why UB-related errors can be tough to identify:

leading_na(rep(NA, 5))
# 0
# 1
# 2
# 3
# 4
# [1] TRUE TRUE TRUE TRUE TRUE 

However, when we replace operator[] with the at() member function, which performs the same element access, but also does bounds checking at runtime, the error is apparent:

// [[Rcpp::export]]
Rcpp::LogicalVector leading_na2(Rcpp::IntegerVector x) {
    int n = x.size();
    Rcpp::LogicalVector leading_na(n);

    int i = 0;
    while (x.at(i) == NA_INTEGER) {
        Rcpp::Rcout << i << "\n";

        leading_na[i] = TRUE;
        i++;
    }

    return leading_na;
}

and then

leading_na2(rep(NA, 5))
# 0
# 1
# 2
# 3
# 4
# Error: index out of bounds 

Note that the additional bounds checking provided by at does come at a slight performance cost, since this check happens at runtime, so even though it can be a good idea to use at instead of operator[] during the development stages, once your code has been thoroughly tested it is probably a good idea to revert to operator[], assuming the better performance is desired.


As for solutions, the first would be to keep the while loop and simply add a check on the value of i:

while (i < n && x[i] == NA_INTEGER) {
    leading_na[i] = TRUE;
    i++;
} 

Notice that I wrote i < n && x[i] == NA_INTEGER and not x[i] == NA_INTEGER && i < n. Since && performs short-circuit evaluation, when i < n evaluates as false in the first version, the expression x[i] == NA_INTEGER is not evaluated -- which is good, because as we have seen, that is undefined behavior.

Another option would be to use a for loop instead, which tend to do a better job of "reminding" us to check our boundaries, in my experience, at least:

for (int i = 0; i < n && x[i] == NA_INTEGER; i++) {
    leading_na[i] = TRUE;
}

The choice to use a while loop or a for loop does not really matter in this case, provided that whatever you choose is written correctly.

Yet another option (or two) is to work with iterators rather than indices, in which case you could use either a while loop or a for loop:

// [[Rcpp::export]]
Rcpp::LogicalVector leading_na5(Rcpp::IntegerVector x) {
    int n = x.size();
    Rcpp::LogicalVector leading_na(n);

    Rcpp::IntegerVector::const_iterator it_x = x.begin();
    Rcpp::LogicalVector::iterator first = leading_na.begin(),
        last = leading_na.end();

/*
    while (first != last && *it_x++ == NA_INTEGER) {
        *first++ = TRUE;
    }
*/

    for ( ; first != last && *it_x == NA_INTEGER; ++first, ++it_x) {
        *first = TRUE;
    }

    return leading_na;
} 

Although iterators are very useful devices, I'm not sure they provide any benefit over manual indexing in this particular case, so I would recommend using one of the first two approaches.


Unrelated to the segfault, there are a few other aspects of your code worth addressing.

  1. In R, && and || perform atomic logical AND and atomic logical OR, respectively, while & and | perform vectorized logical AND and vectorized logical OR, respectively. In C++, && and || behave as they do in R, but & and | are (atomic) bitwise AND and (atomic) bitwise OR, respectively. Just by chance, using & had the same effect as using && in your function above, but you will want to fix that since your intention was to use the logical operation, rather than the bitwise counterpart.
  2. This is more specific to Rcpp / R's C API, but although using x[i] == NA_INTEGER does, in fact, test if x[i] is NA, not all types behave like this. IIRC, testing anything for equality against NA_REAL is always false, even NA_REAL == NA_REAL; and for non-integer arithmetic types (numeric and complex (REALSXP / CPLXSXP)), you most likely also want to be checking if the value is NaN. Rcpp provides a few different ways to do this depending on the object type. For vectors of any storage type, Rcpp::is_na(x) will return a logical vector the same size as x. For atomic values, I typically use Rcpp::traits::is_na<SEXPTYPE>(x[i]) -- REALSXP for double, INTSXP for int, CPLXSXP for Rcomplex, and so on. However, I think you can equivalently use the vector's corresponding static member function, e.g. Rcpp::NumericVector::is_na(x[i]), etc., in which case you don't need to memorize the various SEXPTYPEs.
  3. Strictly speaking there is no TRUE or FALSE in C++ or C; these are (presumably) convenience typedefs provided by R's API, so just be aware that they do not exist outside of R's backend. Of course, feel free to use them in your Rcpp code, as they clearly behave as intended, but most people stick to the standard true and false even when working with Rcpp.
  4. Kind of a nitpick, but your leading_na function declares a local variable also named leading_na, which is slightly confusing, or at least unorthodox.
  5. Consider using std::size_t (standard C++) or R_xlen_t (R API specific) when working with sizes of objects, such as in this expression: int n = x.size();. Those are unsigned integer types which should be large enough to store the length of any object, where as int is a signed integer type which may or may not be sufficient (it usually is). 99.9% of the time the worst that will happen is you will get some additional compiler warnings (not errors) when using ints rather than the other two in expressions like for (int i = 0; i < x.size(); i++) { // whatever }. On rare occasion there may be worse repercussions, such as signed integer overflow (which is also undefined behavior), so just be aware of this remote possibility.

This answer kind of turned into a code review / soapbox rant, but hopefully you find some useful information in there.

Comments