Jonny Stallings Jonny Stallings - 1 month ago 6
R Question

How to extract the last 4 digits of a string of characters in R

I would like to extract the LAST 4 digits in a given string, but can't figure it out. The LAST 4 digits could be "XXXX" or "XXXX-". Ultimately, I have a list of heterogeneous entries that include single years (i.e., 2001- or 2001), lists of years (i.e., 2001, 2004-), range of years (i.e., 2001-2010), or a combination of these with or without a dash ("-") at the end of the entry.

I realize that '$' is the token to identify the END, and '^' is used to identify the START in regular expressions. I'm able to extract the FIRST 4 easily. Here is an example of what I'm able to do and the code that is not working for the LAST 4 digits:

library(stringr)
test <- c("2009-", "2008-2015", "2001-, 2003-2010, 2012-")
str_extract_all(test, "^[[:digit:]]{4}") # Extracts FIRST 4



[[1]]

[1] "2009" "2008" "2001"


str_extract_all(test, "[[:digit:]]{4}$") # Does not extract LAST 4



[[1]]

character(0)

[[2]]

"2015"

[[3]]

character(0)


str_extract_all(test, "\\d{4}$")



[[1]]

character(0)

[[2]]

"2015"

[[3]]

character(0)


The result I desire is:


[1] "2009" "2015" "2012"

Answer

We can try with sub

sub(".*(\\d+{4}).*$", "\\1", test)
#[1] "2009" "2015" "2012"
Comments