UnnamedUser UnnamedUser - 15 days ago 5
R Question

Rcpp subsetting rows of DataFrame

I wished to build in

Rcpp
such a subset of the
iris
dataset:

head(subset(iris, Species == "versicolor"))

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor


I know how to subset columns of
Rcpp::DataFrame
- there is an overloaded operator
[
which works as in R:
x["var"]
. However, I cannot find any way that would allow me to subset rows of a DataFrame with a not fixed number of columns.

I would like to write a function
subset_rows_rcpp_iris
which takes
Rcpp::DataFrame
(which will always be iris) and a
CharacterVector level_of_species
as inputs. It will return
DataFrame
object.

DataFrame subset_rows_rcpp_iris(DataFrame x, CharacterVector level_of_species) {
...
}


First, I want to find indices of rows that satisfy logical query. My problem is that if I access the
Species
vector in
test
function, save it as a
CharacterVector
and then compare it with
level_of_species
I get always only one
TRUE
value in case of
setosa
and FALSE values in other cases.

cppFunction('
LogicalVector test(DataFrame x, CharacterVector level_of_species) {
CharacterVector sub = x["Species"];
LogicalVector ind = sub == level_of_species;
return(ind);
}
')
head(test(iris, "setosa"))

[1] TRUE FALSE FALSE FALSE FALSE FALSE


If this worked, I could rewrite
test
function and use the vector with TRUE/FALSE values to subset each of the column of the data frame separately and then combine them again with
Rcpp::DataFrame::create
.

Answer
cppFunction('LogicalVector test(DataFrame x, StringVector level_of_species) {
  using namespace std;  
  StringVector sub = x["Species"];
  std::string level = Rcpp::as<std::string>(level_of_species[0]);
  Rcpp::LogicalVector ind(sub.size());
  for (int i = 0; i < sub.size(); i++){
      ind[i] = (sub[i] == level);
  }

  return(ind);
}')

xx=test(iris, "setosa")
> table(xx)
 xx
 FALSE  TRUE 
   100    50 

Subsetting done!!! (i myself learnt a lot from this question..thanks!)

cppFunction('Rcpp::DataFrame test(DataFrame x, StringVector level_of_species) {
  using namespace std;  
  StringVector sub = x["Species"];
  std::string level = Rcpp::as<std::string>(level_of_species[0]);
  Rcpp::LogicalVector ind(sub.size());
  for (int i = 0; i < sub.size(); i++){
    ind[i] = (sub[i] == level);
  }

 // extracting each column into a vector
 Rcpp::NumericVector   SepalLength = x["Sepal.Length"];
 Rcpp::NumericVector   SepalWidth = x["Sepal.Width"];
 Rcpp::NumericVector PetalLength = x["Petal.Length"];
 Rcpp::NumericVector   PetalWidth = x["Petal.Width"];


 return Rcpp::DataFrame::create(Rcpp::Named("Sepal.Length")  = SepalLength[ind],
                                Rcpp::Named("Sepal.Width")  = SepalWidth[ind],
                                Rcpp::Named("Petal.Length")  = PetalLength[ind],
                                Rcpp::Named("Petal.Width")  = PetalWidth[ind]
);}')

yy=test(iris, "setosa")
> str(yy)
 'data.frame':  50 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
Comments