jrzelling jrzelling - 3 months ago 18
R Question

use grepl() to match multiple patterns on data R

This command works to subset the data

filelist
to remove all "jpg" files.

filetype.isnotjpg <- setdiff(filelist, subset(filelist, grepl("\\.jpg$", filelist)))


So this takes the string "filelist" which contains names of files from a directory. I want to return all files that are not of type "jpg", "doc", "pdf", "xls", etc. I want to be able to specify as many types as I want to filter the list.

Ideally something like

target.files <- setdiff(filelist, subset(filelist, grepl(
c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), filelist)


This recursive algorithm works to do what I want:

a <- setdiff(files.list, subset(files.list, grepl("\\.tmp", files.list, ignore.case = TRUE)))

a <- setdiff(a, subset(a, grepl("\\.jpg", a, ignore.case = TRUE)))
a <- setdiff(a, subset(a, grepl("\\.pdf", a, ignore.case = TRUE)))
a <- setdiff(a, subset(a, grepl("\\.tif", a, ignore.case = TRUE)))


etc. Something like apply() might work? I'm new to R sorry.

The solution of 42 works:

target.files <- setdiff(
files.list,
subset(files.list,
grepl(
paste(
c("\\.jpg", "\\.doc", "\\.pdf",
"\\.xls", "\\.tif", "\\.docx", "\\.xlsx", "\\.jpeg"),
collapse="|") ,
files.list,
ignore.case = TRUE)))

42- 42-
Answer

I would try paste()-ing with a collapsing separator of "|" which is the OR operator for regex:

target.files <- setdiff(filelist, subset(filelist, grepl( paste(
c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), collapse="|") , filelist)

Did you know that the list.files function also accepts a pattern argument so you could do this in a single step with something like:

 my_files <- list.files(path="/path/to/dir/", 
                        pattern=paste( c("\\.jpg$", "\\.doc$", "\\.pdf$", "\\xls$"), 
                                       collapse="|") )