Chuck Dickens Chuck Dickens - 27 days ago 9
R Question

R Regex remove everything after Underscore if underscore is after position 3

I have been searching for a solution for two days.

Here is a sample of what my data looks like and what I would like to achieve:

dat <- c("f__dfty","fd_fgtekg","f_glgkt_s2","f_glgkt_s3","fthssfy_s2","fthssfy_s3","h__gkdnt_s2","sedfgrtsd")
dat <- c("f__dfty","fd_fgtekg","f_glgkt","f_glgkt","fthssfy","fthssfy","h__gkdnt","sedfgrtsd")


I need to remove everything after an "_", but not if the underscore is in position 2 and or 3 of the string. Not every string will have an underscore.

Thanks!

Answer Source

Brief

Not sure about length of strings, so I'll assume any length can be used.


Code

See this code in use here

Regex

^((?:.{3})?[^_\s]+).*$

Note: You can actually use ^((?:.{3})?[^_]+).*$ instead, but since my example on regex101 uses multiline input to simplify things, I posted the code I used there.

Substitution

$1

Results

Input

f__dfty
fd_fgtekg
f_glgkt_s2
f_glgkt_s3
fthssfy_s2
fthssfy_s3
h__gkdnt_s2
sedfgrtsd
aaaaaaa_aaaa

Output

f__dfty
fd_fgtekg
f_glgkt
f_glgkt
fthssfy
fthssfy
h__gkdnt
sedfgrtsd
aaaaaaa

Explanation

  • Assert position at beginning of line ^
  • Capture the following
    • Optional match of any character 3 times (?:.{3})?
    • Match between 1 and unlimited of any character not present in the set _\s (\s to prevent newline matches in example on regex101; this can be removed from your code if looping through an array/list/etc.) [^_\s]+
  • Match any character any number of times .*
  • Assert position at the end of the line $
  • Replace with first capture group $1