jul635 - 28 days ago 9

R Question

I have a data frame of character strings that is >1M rows long:

`>head(df)`

A B C D

1 S1 S2 U1 U2

2 S1 S2 S2 S1

3 S2 S1 S1 S2

4 S1 M2 U1 S2

5 S1 S1 M2 M1

6 M2 M2 M1 M2

I would like to identify all rows where a particular character is present (e.g., "U").

The solutions I have found so far are working, but they are very slow, for example:

`matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })`

Any idea how to improve this query?

Thanks!

Answer

EDIT: updates to address comments:

The following is also very fast (0.31 seconds, even faster than before):

```
rows <- which(
rowSums(
`dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
) > 0
)
```

And produces the same result as previous answers. Using `fixed=FALSE`

about doubles the time, but your example doesn't require that.

What we're doing here is cheating by applying `grepl`

to a matrix, though really what we care about is turning `df`

into a vector (which a matrix is), and `as.matrix`

is one of the faster ways to do this. Then we can just run one `grepl`

command. Finally, we use `dim<-`

to turn the `grepl`

vector result back into a matrix, and use `rowSums`

to check which rows had matches.

Here are the reasons why this is much faster than your version:

- We call
`grepl`

once, instead of a million times as you do with`apply`

since the function`apply`

applies gets called once for each row;`grepl`

is vectorized which means you want to minimize how many times you call it and take advantage of the vectorization - We do the row match counts with
`rowSums`

instead of`apply`

;`rowSums`

is a much faster version of`apply(x, 1, sum)`

(see docs for`?rowSums`

).

PREVIOUS ANSWER:

Here is a relatively straightforward solution that runs in 0.35 seconds on my system for a 1MM row by 4 column data frame:

```
rows <- which(rowSums(as.matrix(df) == "U") > 0)
```

To confirm

```
df[head(rows), ]
```

produces (every row has a U):

```
a b c d
5 F B D U
8 R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H
```

And the data:

```
set.seed(1)
df <- as.data.frame(
`names<-`(
replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
letters[1:4]
)
)
```