dimitris_ps - 2 years ago 152
R Question

# Vectorized pattern matching returning the pattern in R

My problem is mostly that of efficiency.

I have a vector of patterns that i would like to match against a vector

`x`
.

The end result should return the pattern that is match to each element of the vector. A second criteria would be, if many patterns are matched for a specific element of the vector
`x`
, then return the first pattern matched.

For example, lets say the vector of patterns is:

``````patterns <- c("[0-9]{2}[a-zA-Z]", "[0-9][a-zA-Z] ", " [a-zA-Z]{3} ")
``````

and the vector
`x`
is:

``````x <- c("abc 123ab abc", "abc 123 abc ", "a", "12a ", "1a ")
``````

The end result would be:

``````customeRExp(patterns, x)
[1] "[0-9]{2}[a-zA-Z]" " [a-zA-Z]{3} "
[3]  NA                "[0-9]{2}[a-zA-Z]"
[5] "[0-9][a-zA-Z] "
``````

This is what i have so far:

``````customeRExp <- function(pattern, x){
m <- matrix(NA, ncol=length(x), nrow=length(pattern))
for(i in 1:length(pattern)){
m[i, ] <- grepl(pattern[i], x)}
indx <- suppressWarnings(apply(m, 2, function(y) min(which(y, TRUE))))
pattern[indx]
}

customeRExp(patterns, x)
``````

Which correctly returns:

``````[1] "[0-9]{2}[a-zA-Z]" " [a-zA-Z]{3} "    NA
[4] "[0-9]{2}[a-zA-Z]" "[0-9][a-zA-Z] "
``````

The problem is that my dataset is huge, and the list of patterns quite big also.

Is there a more efficient way of doing the same?

My default approach to speeding up loops like the above is generally to just rewrite in C++. Here's a quick attempt using Boost Xpressive:

``````// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/xpressive/xpressive.hpp>

namespace xp = boost::xpressive;

// [[Rcpp::export]]
Rcpp::CharacterVector
first_match(Rcpp::CharacterVector x, Rcpp::CharacterVector re) {
R_xlen_t nx = x.size(), nre = re.size(), i = 0, j = 0;
Rcpp::CharacterVector result(nx, NA_STRING);
std::vector<xp::sregex> vre(nre);

for ( ; j < nre; j++) {
vre[j] = xp::sregex::compile(std::string(re[j]));
}

for ( ; i < nx; i++) {
for (j = 0; j < nre; j++) {
if (xp::regex_search(std::string(x[i]), vre[j])) {
result[i] = re[j];
break;
}
}
}

return result;
}
``````

The point of this approach is to save unnecessary calculations by `break`ing as soon as we find a matching regular expression.

The performance increase isn't earth-shattering (~40%), but it is an improvement over your current function. Here is a test using larger versions of your sample data:

``````x2 <- rep(x, 5000)
p2 <- rep(patterns, 100)

all.equal(first_match(x2, p2), customeRExp(p2, x2))
#[1] TRUE

microbenchmark::microbenchmark(
first_match(x2, p2),
customeRExp(p2, x2),
times = 50
)
# Unit: seconds
#                 expr      min       lq     mean   median       uq      max neval
#  first_match(x2, p2) 1.743407 1.780649 1.900954 1.836840 1.931783 2.544041    50
#  customeRExp(p2, x2) 2.368621 2.459748 2.681101 2.566717 2.824887 3.553025    50
``````

Another option would be to look into using the `stringi` package which generally outperforms base R by a good margin.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download