Bonono Bonono - 1 month ago 9
R Question

extract character string dynamically from character vector r

Here are three character vectors:

[1] "Session_1/Focal_1_P1/240915_P1_S1_F1.csv"
[2] "Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv"
[3] "Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv"


I'm trying to extract the strings
P1
,
PA10
and
DA100
, respectively in a standardised manner (as I have several hundred other strings in which I want to extract this.

I know I need to use
regex
but I'm fairly new to it and not exactly sure which one.

I can see that the commonalities are 6 numbers (
\d\d\d\d\d\d
)followed by an
_
and then what I want followed by another
_
.

How do I extract what I want? I believe with
grep
but am not 100% on the regular expression I need.

Answer

We can use gsub. We match zero or more characters (.*) followed by a forward slash (\\/), followed by one or more numbers and a underscore (\\d+_), or (!) two instances of an underscore followed by one or more characters that are not an underscore ((_[^_]+){2}) and replace it with blank ("").

gsub(".*\\/\\d+_|(_[^_]+){2}", "", v1)
#[1] "P1"    "PA10"  "DA100"

Or we extract the basename of the vector, match one or more numbers followed by underscore (\\d+_) followed by character not an underscore (([^_]+)) as a capture group followed by characters until the end of the string and replace it with the backreference (\\1) for the captured group.

sub("\\d+_([^_]+).*", "\\1", basename(v1))
#[1] "P1"    "PA10"  "DA100"

data

v1 <- c( "Session_1/Focal_1_P1/240915_P1_S1_F1.csv",
       "Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv",
       "Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv")