user3032689 user3032689 - 9 months ago 47
R Question

Extract 2 parts of a string

Assume I have the following string (filename):

a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"

which consists of several parts (here is given p1)

or another one

b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"

which consists of only one part (so no need to label any p)

How can I extract the Identifier, which is the three letters before the
(so in case one it would be
, in case two it would be
) PLUS the part identifier, if available?

So the result should be:

case1 : TKN_p1
case2 : ZHN

I know how to extract the first identifier, but I cannot handle the second one at the same time.

My approach so far:

sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", a)
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", b)

but this adds
incorrectly in the second case.

Answer Source

You are not using anchors and matching the last 3 characters right after timely without checking what these characters are (. matches any character).

I suggest

sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)


  • ^ - start of string
  • .*/ - part of string up to and including the last /
  • ([A-Z]{3}) - 3 ASCII uppercase letters captured into Group 1
  • _VAR\\d+_timely - _VAR + 1 or more digits + _timely
  • (_[^_.]+)? - an optional Group 2 capturing _ + 1 or more chars other than _ and .
  • \\. - a dot
  • [^.]* - zero or more chars other than .
  • $ - end of string.

Replacement pattern contains 2 backreferences to both the capturing groups to insert their contents to the replaced string.

R demo:

a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
a2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
[1] "TKN_p1"
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
b2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", b)
[1] "ZHN"