Dee Dee - 1 month ago 9
R Question

Regex - Substitute character in a matching substring

Let's say I have the following string:

input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"


I need to replace the white-spaces with underscores, but only in the substrings that match a pattern. (In this case the pattern would be a semi-colon before and after.)

The expected output should be:

output = "askl jmsp wiqp;THIS_IS_A_MATCH; dlkasl das, fm"


Any ideas how to achieve that, preferably using regular expressions, and without splitting the string?

I tried:

gsub("(.*);(.*);(.*)", "\\2", input) # Pattern matching and
gsub(" ", "_", input) # Naive gsub


Couldn't put them both together though.

Answer

Regarding the original question:

Substitute character in a matching substring

You may do it easily with gsubfn:

> library(gsubfn)
> input = "askl jmsp wiqp;THIS IS A MATCH; dlkasl das, fm"
> gsubfn(";([^;]+);", function(g1) paste0(";",gsub(" ", "-", g1, fixed=TRUE),";"), input)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"

The ;([^;]+); matches any string starting with ; and up to the next ; capturing the text in-between and then replacing the whitespaces with hyphens only inside the captured part.

Another approach is to use a PCRE regex with a \G based regex with gsub:

p = "(?:\\G(?!\\A)|;)(?=[^;]*;)[^;\\s]*\\K\\s"
> gsub(p, "-", input, perl=TRUE)
[1] "askl jmsp wiqp;THIS-IS-A-MATCH; dlkasl das, fm"

See the online regex demo

Pattern details:

  • (?:\\G(?!\\A)|;) - a custom boundary: either the end of the previous successful match (\\G(?!\\A)) or (|) a semicolon
  • (?=[^;]*;) - a lookahead check: there must be a ; after 0+ chars other than ;
  • [^;\\s]* - 0+ chars other than ; and whitespaces
  • \\K - omitting the text matched so far
  • \\s - 1 single whitespace character (if multiple whitespaces are to be replaced with 1 hyphen, add + after it).
Comments