Jeffrey Kramer Jeffrey Kramer - 2 months ago 12
R Question

Regex to filter, then determine latest date

Say I have a directory with four files:

This is how the names are formatted. I cannot change them, and the format will remain the same except the dates will change. All of the names begin with
. Next, there is either a four-letter code (
) or a three latter code (
). If the file name has a four letter code, it will always have a three-letter code after it. Finally there is a date value.

I have two tasks. First, I need to filter out the files that have the "abcd" component. This will always be a four-character code that appears after the
in the name. Is there a way to right a regex expression to remove these values?

That leaves two files:

I need only the file with the later date. Is there a second regex I could do to extract the dates, find the latest, and then keep only that date? I'm doing this to get the file set down to four:

myDir <- "\\\\myDir\\folder\\"
files <- list.files(path = myDir, pattern = "\\.csv$")

Here's a vector with the file names if someone wants to try it out:

files <- c("", "", "", "")


Here's my attempt at a simple base R answer

# regex subset 
files <- files[!grepl("^.*?\\.[[:alpha:]]{4}\\.", files)]

# get date
dates <- unlist(lapply(strsplit(files, "\\."), "[[", 3))

files[which.max(as.Date(dates, format = "%d%b%y"))]
# [1] ""