jrzelling jrzelling - 4 months ago 26
R Question

use grepl() to match multiple patterns on data R

This command works to subset the data

to remove all "jpg" files.

filetype.isnotjpg <- setdiff(filelist, subset(filelist, grepl("\\.jpg$", filelist)))

So this takes the string "filelist" which contains names of files from a directory. I want to return all files that are not of type "jpg", "doc", "pdf", "xls", etc. I want to be able to specify as many types as I want to filter the list.

Ideally something like

target.files <- setdiff(filelist, subset(filelist, grepl(
c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), filelist)

This recursive algorithm works to do what I want:

a <- setdiff(files.list, subset(files.list, grepl("\\.tmp", files.list, ignore.case = TRUE)))

a <- setdiff(a, subset(a, grepl("\\.jpg", a, ignore.case = TRUE)))
a <- setdiff(a, subset(a, grepl("\\.pdf", a, ignore.case = TRUE)))
a <- setdiff(a, subset(a, grepl("\\.tif", a, ignore.case = TRUE)))

etc. Something like apply() might work? I'm new to R sorry.

The solution of 42 works:

target.files <- setdiff(
c("\\.jpg", "\\.doc", "\\.pdf",
"\\.xls", "\\.tif", "\\.docx", "\\.xlsx", "\\.jpeg"),
collapse="|") ,
ignore.case = TRUE)))

42- 42-

I would try paste()-ing with a collapsing separator of "|" which is the OR operator for regex:

target.files <- setdiff(filelist, subset(filelist, grepl( paste(
c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), collapse="|") , filelist)

Did you know that the list.files function also accepts a pattern argument so you could do this in a single step with something like:

 my_files <- list.files(path="/path/to/dir/", 
                        pattern=paste( c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), 
                                       collapse="|") )