RAS RAS - 4 months ago 12
R Question

Regex to extract values between 2 underscores, including a value that is an underscore

I am working in R and and trying to extract part of a character string separated with underscores, including an underscore:


I wish to obtain an output like this:


What regex do I need to extract this information?


We can use gsub to match one or more characters (.*) followed by a _ followed by a lower case letter ([a-z]) or | a _ followed by one or more numbers (\\d+) until the end ($) of the string and replace it with blanks ("").

gsub(".*_[a-z]|_\\d+$", "", str1)
#[1] "1_QC1" "3_QC1"

Or use sub with capture groups to match two instances of one or more not a underscore followed by a underscore (([^_]+_){2}) from the start (^) of the string followed by a lower case letter ([a-z]), and then we capture the group within the brackets ((...)) for one or more numbers (\\d+) followed by _ and one or more alpha numeric characters ([[:alnum:]]+) close the capture group bracket followed by underscore (_) and one or more numbers (\\d+). We replace it with the second capture group (\\2).

sub("^([^_]+_){2}[a-z](\\d+_[[:alnum:]]+)_\\d+", "\\2", str1)
#[1] "1_QC1" "3_QC1"


str1 <- c("WRAP_384_p1_QC1_8", "WRAP_384_p3_QC1_7")