RAS RAS - 3 months ago 6
R Question

Regex to extract values between 2 underscores, including a value that is an underscore

I am working in R and and trying to extract part of a character string separated with underscores, including an underscore:

WRAP_384_p1_QC1_8
WRAP_384_p3_QC1_7


I wish to obtain an output like this:

1_QC1
3_QC1


What regex do I need to extract this information?

Answer

We can use gsub to match one or more characters (.*) followed by a _ followed by a lower case letter ([a-z]) or | a _ followed by one or more numbers (\\d+) until the end ($) of the string and replace it with blanks ("").

gsub(".*_[a-z]|_\\d+$", "", str1)
#[1] "1_QC1" "3_QC1"

Or use sub with capture groups to match two instances of one or more not a underscore followed by a underscore (([^_]+_){2}) from the start (^) of the string followed by a lower case letter ([a-z]), and then we capture the group within the brackets ((...)) for one or more numbers (\\d+) followed by _ and one or more alpha numeric characters ([[:alnum:]]+) close the capture group bracket followed by underscore (_) and one or more numbers (\\d+). We replace it with the second capture group (\\2).

sub("^([^_]+_){2}[a-z](\\d+_[[:alnum:]]+)_\\d+", "\\2", str1)
#[1] "1_QC1" "3_QC1"

data

str1 <- c("WRAP_384_p1_QC1_8", "WRAP_384_p3_QC1_7")