Daniel Daniel - 12 days ago 6
R Question

Split character vector at math comparisons signs in R

I would like to split expression with mathematical comparisons, e.g.

unlist(strsplit("var<3", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var==5", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var>2", "(?=[=<>])", perl = TRUE))


The results are:

[1] "var" "<" "3"
[1] "var" "=" "=" "5"
[1] "var" ">" "2"


For the 2nd example above, I would like to get
[1] "var" "==" "5"
, so the two
=
should be returned as a single element. How do I need to change my regular expression to achieve this? (I already tried grouping and quantifiers for "==", but nothing worked - regular expressions are not my friends...)

Answer

You may use a PCRE regex to match the substrings you need:

==|[<>]|(?:(?!==)[^<>])+

To also support !=, modify it as

[!=]=|[<>]|(?:(?![=!]=)[^<>])+

See the regex demo.

Details:

  • == - 2 = signs
  • | - or
  • [<>] - a < or >
  • | - or
  • (?:(?!==)[^<>])+ - 1 or more chars other than < and > ([^<>]) that do not start a == char sequence (a tempered greedy token).

NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token.

R test:

> text <- "Text1==text2<text3><More here"
> res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE))
> res
[[1]]
[1] "Text1"     "=="        "text2"     "<"         "text3"     ">"        
[7] "<"         "More here"
Comments