jul635 jul635 - 28 days ago 9
R Question

R: efficiently grep characters in rows of large data.frame

I have a data frame of character strings that is >1M rows long:

>head(df)
A B C D
1 S1 S2 U1 U2
2 S1 S2 S2 S1
3 S2 S1 S1 S2
4 S1 M2 U1 S2
5 S1 S1 M2 M1
6 M2 M2 M1 M2


I would like to identify all rows where a particular character is present (e.g., "U").
The solutions I have found so far are working, but they are very slow, for example:

matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })


Any idea how to improve this query?
Thanks!

Answer

EDIT: updates to address comments:

The following is also very fast (0.31 seconds, even faster than before):

rows <- which(
  rowSums(
    `dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
  ) > 0
)

And produces the same result as previous answers. Using fixed=FALSE about doubles the time, but your example doesn't require that.

What we're doing here is cheating by applying grepl to a matrix, though really what we care about is turning df into a vector (which a matrix is), and as.matrix is one of the faster ways to do this. Then we can just run one grepl command. Finally, we use dim<- to turn the grepl vector result back into a matrix, and use rowSums to check which rows had matches.

Here are the reasons why this is much faster than your version:

  • We call grepl once, instead of a million times as you do with apply since the function apply applies gets called once for each row; grepl is vectorized which means you want to minimize how many times you call it and take advantage of the vectorization
  • We do the row match counts with rowSums instead of apply; rowSums is a much faster version of apply(x, 1, sum) (see docs for ?rowSums).

PREVIOUS ANSWER:

Here is a relatively straightforward solution that runs in 0.35 seconds on my system for a 1MM row by 4 column data frame:

rows <- which(rowSums(as.matrix(df) == "U") > 0)

To confirm

df[head(rows), ]

produces (every row has a U):

   a b c d
5  F B D U
8  R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H

And the data:

set.seed(1)
df <- as.data.frame(
  `names<-`(
    replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
    letters[1:4]
  )
)