Andrew Brown Andrew Brown - 1 month ago 7
R Question

Regex in R select sentence ending in new line

My understanding is that R uses either extended regular expressions or Perl-like regular expressions. I have searched SO and the web for a solution to this regex problem but I have come up empty:

In R I have a vector of text files. Each element consists of a few paragraphs. I would like extract a few sentences from each element to create a new vector with this subset of text. The sentence I would like to extracts follow a predicable pattern.

text <- c("AND \n \n house notes: text text/text.\n \n text text \n text",
"AND \n \n notes: text text/text.\n \n text text \n text",
"AND \n \n house: text text/text.\n \n text text \n text")


I would like to extract all the text between the "house notes", "house" or "notes" and the first "\n". The words "house notes", "house" or "notes" may be else where in the document but I'm interested in their first occurrence.

> output
"house notes: text text/text.\n",
"notes: text text/text.\n ",
"house: text text/text.\n "


I can get it to work in php
\w++ notes: \w++ \w*+[^_]\w[^:\\]*+\\\w
but not R.

Answer

You should note that you tested against a string with literal \n (backslash + n), and you used the PCRE regex flavor (\w++ contains a possessive quantifier) and you need to use perl=TRUE in base R regex functions to use such regexps.

Since you just want to get text from a specific string up to a newline, the best pattern is a group of alternatives, then a negated character class (matching any chars but \n) and a newline:

> text <- c("AND \n \n house notes: text text/text.\n \n text text \n text",
+           "AND \n \n notes: text text/text.\n \n text text \n text",
+           "AND \n \n house: text text/text.\n \n text text \n text")
> 
> pat = "(house( notes)?|notes):[^\n]*\n"
> regmatches(text, gregexpr(pat, text))
[[1]]
[1] "house notes: text text/text.\n"

[[2]]
[1] "notes: text text/text.\n"

[[3]]
[1] "house: text text/text.\n"

Details:

  • (house( notes)?|notes) - a group matching either house, house notes, or notes
  • : - a colon
  • [^\n]* - a negated character class matching any char but a newline
  • \n - a newline.