E J E J - 3 months ago 109
R Question

Automated grep() across multiple columns in large dataset in R

EDIT Reproducible example at the bottom...

I am working with a large dataset (pooled NHAMCS from the CDC):

> dim(ed0509)
[1] 174020 514


I'm having trouble using
grep()
to identify rows in a data frame based on patterns in multiple column variables
DIAG1
DIAG2
DIAG3
based on a vector list of interest
SSTI.list
. The condition is that if this pattern is identified in either one of column variables, then I want to pull that row number out to ultimately use this to subset the data to create a new categorical column
SSTI.cat
in the dataset (0 or 1).

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684","685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", "7714", "7715", "7854", "9583", "99662", "99762", "9985")


Since I am dealing with a pretty long list >1000s of elements, I'm trying to automate this process using a for loop. The desired output is having new variables that contain the list of rows for each value in the vector
SSTI.list
. I have mainly having issues running a for loop within
grep()
and I get the error:

argument 'pattern' has length > 1 and only the first element will be used


What I have tried to do so far is:

diags <- c(ed0509$DIAG1,ed0509$DIAG2,ed0509$DIAG3)


for (i in SSTI.list){
assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F))
}


SSTI.comb
would be the final list of rows (all of
var
i) that identified the patterns in
SSTI.list
from the for loop that would be used to create the categorical variable
SSTI.cat


Then used the
data.table
package to create the categorical variable.

SSTI.comb<-sort(as.numeric(SSTI.comb))


setDT(ed0509)[SSTI.comb,SSTI.cat:=1][,SSTI.cat:=0]


EDIT for reproducibility, sorry about that...

DIAG1=c("00000","4659-","0356-","5664-","771--","7715-","78791")
DIAG2=c("3829-","00000","00000","4659-","7854-","00000","566--")
DIAG3=c("9985-","00000","00000","00000","00000","00000","00000")
df<-data.frame(DIAG1,DIAG2,DIAG3)`

SSTI.list <- c("035","9985","7854","771","7715")

for (i in SSTI.list){
assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F))
}


Conceptually I would like to have an output where the new column variable attached to
df
would indicate that the 1st, 3rd, 5th and 6th rows are identified to satisfy the pattern indicated in
SSTI.list


DIAG1 DIAG2 DIAG3 SSTI.cat
1 00000 3829- 9985- 1
2 4659- 00000 00000 0
3 0356- 00000 00000 1
4 5664- 4659- 00000 0
5 771-- 7854- 00000 1
6 7715- 00000 00000 1
7 78791 566-- 00000 0

Answer

Here's an example with fake data that I cooked up before you added your data. Let me know if this is what you had in mind:

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684",
               "685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", 
               "7714", "7715", "7854", "9583", "99662", "99762", "9985")

# Fake data
set.seed(10)
dat = as.data.frame(replicate(5, sample(c(SSTI.list, 1e5:(1e5+1000)),10)), stringsAsFactors=FALSE)
       V1     V2     V3     V4     V5
1  100493 100642 100861 100522 100254
2  100286 100555 100604 100066 100206
3  100409 100087 100767 100145   7048
4  100682 100583 100336 100895 100719
5  100058 100338 100387 100404 100227
6  100202 100410 100695 100737 100136
7  100252 100024 100829 100813   7078
8  100249 100241 100216 100947 100468
9  100600 100378 100758 100671 100076
10 100998 100824 100334 100482 100789
# Match any instance of a pattern within any element of the data
dat[apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))),]
      V1     V2     V3     V4     V5
3 100409 100087 100767 100145   7048
4 100682 100583 100336 100895 100719  # "100682 matches "682" in SSTI.list
7 100252 100024 100829 100813   7078
# Match only if a data element is exactly the same as one of the patterns.
dat[apply(dat, 1, function(i) any(grepl(paste(paste0("^",SSTI.list,"$"), collapse="|"), i))),]
      V1     V2     V3     V4   V5
3 100409 100087 100767 100145 7048
7 100252 100024 100829 100813 7078

If you just want the row indices of matching rows:

which(apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))))

[1] 3 4 7