Esben Eickhardt Esben Eickhardt - 24 days ago 10
R Question

R: regex from first character to the end of the string

I have strings like these here:

a <- "-en eller -et eller (uofficielt) -'en eller (uofficielt) -'et"
b <- "-ten, -ter, -terne"


And I would like to use regular expressions in R to extract the text from the "-" to the first non-character, thus get:

en et 'en 'et
ten ter terne


I have found a solution, but it just does not feel very satisfying or elegant

a <- unlist(strsplit(a, " |,"))
a <- a[grep("-", a)]
a <- gsub("-", "", a)

b <- unlist(strsplit(b, " |,"))
b <- b[grep("-", b)]
b <- gsub("-", "", b)


Do you have a suggesting for a more elegant one-liner that extracts all the endings I want?

Answer Source

I think you need to match a - that is not preceded with a word char (that is, not match when it is part of a compound word), and there is an optional ' after the hyphen, that is then followed with 1+ word chars. Then, you can use

a <- "-en eller -et eller (uofficielt) -'en eller (uofficielt) -'et"
b <- "-ten, -ter, -terne"
pat <- "\\B-\\K'?\\w+"
res_a <- regmatches(a, gregexpr(pat, a, perl=TRUE))
unlist(res_a)
## [1] "en"  "et"  "'en" "'et"
res_b <- regmatches(b, gregexpr(pat, b, perl=TRUE))
unlist(res_b)
## [1] "ten"   "ter"   "terne"

See the online R demo

Pattern details:

  • \\B - a non-word boundary
  • - - a hyphen
  • \\K - match reset operator
  • '? - an optional '
  • \\w+ - 1 or more letters, digits or _