I would like to extract the text between "one: " and "two: " and between "two: " and "three: " in the string s1 "one: bla 1 two: bla2 three: bla3". However "two: bla2 " is not necessarily present in the string s2. So if it is s2 "one: bla 1 three: bla3" it should also work.
I've come up with the following R-Code, but my attempt with the additional parentheses around the "two:..." and the question mark doesn't work:
s1 <- "one: bla 1 two: bla2 three: bla3"
s2 <- "one: bla 1 three: bla3"
strapplyc(s1, "one: (.*) (two: (.*))? three: (.*)")
strapplyc(s2, "one: (.*) (two: (.*))? three: (.*)")
Perhaps the problem is that the
.* after the
one: is also consuming the
two: part and the text after it. So for example the the matching groups in your line would be
1: "bla 1 two: bla2" 2: [empty] 3: "bla3"
You could fix this by making the first asterisk non-greedy with a question mark.
Some other points: I think you should put the space inside the parentheses in the
two: part, otherwise when it is not available there will have to be two spaces between the
Additionally, for a minor tidy up, you could make the parentheses around the optional part non-capturing with with
?:. You only want to capture three things, and the parentheses around the
two: part are just for the precedence, so it's not necessary to capture.
So altogether you would have something like this:
strapplyc(s1, "one: (.*?)(?: two: (.*))? three: (bla3)")