Lord Stiltskin Lord Stiltskin - 6 months ago 13
PHP Question

Regex (preg_match in php) : last groups in the output array don't work correctly

With this pattern:

(how is\s)?(the\s)?(weather)\s?((on)\s)?(today|tomorrow|sunday|monday|tuesday|wednesday|thursday|friday|saturday|sunday|this week)?(\s(in)\s(.*)\s?(on)?\s?(today|tomorrow|sunday|monday|tuesday|wednesday|thursday|friday|saturday|sunday|this week)?)?


This is what I'm trying to capture

Input :
how is the weather on tuesday in vienna


output :

array(10
0 => how is the weather on tuesday in vienna
1 => how is
2 => the
3 => weather
4 => on
5 => on
6 => tuesday
7 => in vienna
8 => in
9 => vienna
)


Here, I can extract day and location from
array[6]
and
array[9]


Input :
how is the weather in vienna on tuesday


output :

array(10
0 => how is the weather in vienna on tuesday
1 => how is
2 => the
3 => weather
4 =>
5 =>
6 =>
7 => in vienna on tuesday
8 => in
9 => vienna on tuesday
)


But here, the location and day are captured as a whole in
array[9]
. I want it to capture day and location in different elements. Is there anything wrong with the grouping in regex pattern?

Answer

Description

I recommend using optional lookaheads to seek out and find the location or timeframe if they exist.

^(?=(?:.*?on\s(today|tomorrow|sunday|monday|tuesday|wednesday|thursday|friday|saturday|sunday|this week))?)(?=(?:.*?in\s([a-z]+))?)

Regular expression visualization

This regular expression will do the following:

  • capture group 1 always gets the timeframe if it exists in the string
  • capture group 2 always gets the location if it exists in the string
  • allows the location and timeframe to appear in any order in the string

Example

Live Demo

https://regex101.com/r/rN9hG2/1

Sample text

weather on sunday
weather on sunday in vienna
weather in vienna
weather in vienna on sunday

Sample Matches

[0][1] = sunday
[0][2] = 

[1][1] = sunday
[1][2] = vienna

[2][1] = 
[2][2] = vienna

[3][1] = sunday
[3][2] = vienna

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      .*?                      any character except \n (0 or more
                               times (matching the least amount
                               possible))
----------------------------------------------------------------------
      on                       'on'
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
      (                        group and capture to \1:
----------------------------------------------------------------------
        today                    'today'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        tomorrow                 'tomorrow'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        sunday                   'sunday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        monday                   'monday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        tuesday                  'tuesday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        wednesday                'wednesday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        thursday                 'thursday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        friday                   'friday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        saturday                 'saturday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        sunday                   'sunday'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        this week                'this week'
----------------------------------------------------------------------
      )                        end of \1
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      .*?                      any character except \n (0 or more
                               times (matching the least amount
                               possible))
----------------------------------------------------------------------
      in                       'in'
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
      (                        group and capture to \2:
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
      )                        end of \2
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
Comments