Jeffrey Kramer Jeffrey Kramer - 1 month ago 5
R Question

Regex to filter, then determine latest date

Say I have a directory with four files:

someText.abcd.xyz.10Sep16.csv
someText.xyz.10Sep16.csv
someText.abcd.xyz.23Oct16.csv
someText.xyz.23Oct16.csv


This is how the names are formatted. I cannot change them, and the format will remain the same except the dates will change. All of the names begin with
someText
. Next, there is either a four-letter code (
abcd
) or a three latter code (
xyz
). If the file name has a four letter code, it will always have a three-letter code after it. Finally there is a date value.

I have two tasks. First, I need to filter out the files that have the "abcd" component. This will always be a four-character code that appears after the
someText.
in the name. Is there a way to right a regex expression to remove these values?

That leaves two files:

someText.xyz.10Sep16.csv
someText.xyz.23Oct16.csv


I need only the file with the later date. Is there a second regex I could do to extract the dates, find the latest, and then keep only that date? I'm doing this to get the file set down to four:

myDir <- "\\\\myDir\\folder\\"
files <- list.files(path = myDir, pattern = "\\.csv$")


Here's a vector with the file names if someone wants to try it out:

files <- c("someText.abcd.xyz.10Sep16.csv", "someText.xyz.10Sep16.csv", "someText.abcd.xyz.23Oct16.csv", "someText.xyz.23Oct16.csv")

Answer

Here's my attempt at a simple base R answer

# regex subset 
files <- files[!grepl("^.*?\\.[[:alpha:]]{4}\\.", files)]

# get date
dates <- unlist(lapply(strsplit(files, "\\."), "[[", 3))

files[which.max(as.Date(dates, format = "%d%b%y"))]
# [1] "someText.xyz.23Oct16.csv"