tover tover - 1 year ago 63
R Question

How to extract text using regular expressions with optional patterns present?

I would like to extract the text between "one: " and "two: " and between "two: " and "three: " in the string s1 "one: bla 1 two: bla2 three: bla3". However "two: bla2 " is not necessarily present in the string s2. So if it is s2 "one: bla 1 three: bla3" it should also work.

I've come up with the following R-Code, but my attempt with the additional parentheses around the "two:..." and the question mark doesn't work:

s1 <- "one: bla 1 two: bla2 three: bla3"
s2 <- "one: bla 1 three: bla3"
strapplyc(s1, "one: (.*) (two: (.*))? three: (.*)")
strapplyc(s2, "one: (.*) (two: (.*))? three: (.*)")

Answer Source

Perhaps the problem is that the .* after the one: is also consuming the two: part and the text after it. So for example the the matching groups in your line would be

1: "bla 1 two: bla2"
2: [empty]
3: "bla3"

You could fix this by making the first asterisk non-greedy with a question mark.

Some other points: I think you should put the space inside the parentheses in the two: part, otherwise when it is not available there will have to be two spaces between the one: and two: part.

Additionally, for a minor tidy up, you could make the parentheses around the optional part non-capturing with with ?:. You only want to capture three things, and the parentheses around the two: part are just for the precedence, so it's not necessary to capture.

So altogether you would have something like this:

strapplyc(s1, "one: (.*?)(?: two: (.*))? three: (bla3)")