Ismael Miguel Ismael Miguel - 2 months ago 8
PHP Question

Regex to match up to 2 full words and the next word containing the character

I've developed the following regular expression to use in a search field.

The goal is to use it to match up to 2 words, then the full word with the character(s) and everything after:

/^
.*? # match anything before, as few times as possible
(
(?:
[^\s]+\s* # anything followed by whitespace
){1,2} # match once or twice
\s*? # match whitespaces that may be left behind, just in case
[^\s]*? # match the beginning of the word, if exists
)?
(foo|bar) # search term(s)
([^\s]*\s*.*) # whatever is after, with whitespace, if it is the end of the word
$/xi


The problem is that it isn't always matching correctly.

A few examples, when searching for "a":

Fantastic drinks and amazing cakes

Expected match:
$1 = F
$2 = a
$3 = ntastic drinks and amazing cakes

Result:
$1 = Fantastic drinks (space)
$2 = a
$3 = nd amazing cakes

-----------------------------------------

Drinks and party!

Expected match:
$1 = Drinks (space)
$2 = a
$3 = nd party!

Result:
$1 = Drinks and p
$2 = a
$3 = rty!

------------------------------------------

Drinks will be served at the caffetary in 5 minutes

Expected match:
$1 = be served (space)
$2 = a
$3 = t the caffetary in 5 minutes

Result (matches correctly):
$1 = be served (space)
$2 = a
$3 = t the caffetary in 5 minutes


You can experiment with it on https://regex101.com/r/cI7gZ3/1 with unit tests included.

The way that this doesn't work is strange, beyound what I can describe. But, my guess, is that this is prefering matches that have 1-2 words before the search term.

What do you think that may be wrong here? What do you think that is causing these issues?

Answer

I suggest using lazy versions of \S+ and {1,2} in

(?: 
    \S+?\s* # anything followed by whitespace
){1,2}?

and remove the [^\s]*? # match the beginning of the word, if exists part.

See the updated regex demo

^
  .*? # match anything before, as few times as possible
  (
    (?: 
      \S*?\s* # anything followed by whitespace
    ){1,2}?
    \s* # just in case there's whitespace
  )?
  (a) # search term(s)
  (\S*\s*.*) # whatever is after, without whitespace if it is the end of the word
$