I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:
> str_match('xxx aaaab yyy', "a.*?b")
> regexpr('a.*?b', 'xxx aaaab yyy')
By default repetition is greedy, so the maximal possible number of repeats is used.
This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further
quantifiers that allow approximate matching: see the TRE documentation.)
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
[1,] "<a href=\"abc\">"
Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.
Expressions that match your patterns are searched from left to right. Yes, all of the following strings
ab are matches to your pattern, but
aaaab being the one that starts the most to the left is the one that is returned.
So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:
str_match('xxx aaaab yyy', "a.*?y") # [,1] # [1,] "aaaab y"
Here all of the strings
aaaab yyy matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.
So what can you do to catch that last
ab? Use this:
str_match('xxx aaaab yyy', ".*(a.*b)") # [,1] [,2] # [1,] "xxx aaaab" "ab"
How does it work? By adding a greedy pattern
.* in the front, you are now forcing the process to put the last possible
a into the captured group.