MAPK MAPK - 1 month ago 7
R Question

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector

Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021"
. I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:

SS1G_340
SS2G_340


Code I have tried:
gsub("^.*?SS|\\T0", "", Target)

Answer Source

Try this:

gsub(".*(SS.*)T0.*","\\1",Target)

[1] "SS1G_340" "SS2G_340"

Why it works:

With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:

gsub(".*(SS.*)+(T0.*)","\\1",Target)

[1] "SS1G_340" "SS2G_340"

Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:

gsub(".*(SS.*)+(T0.*)","\\2",Target)

[1] "T01"  "T021"

The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.