B_s B_s - 3 months ago 11
PHP Question

Matching specific character if it is between two digits with regex

For some data processing I need to split a string into multiple items.
An example of an input string is:

'one, two & three and four-five 123-456'


Now, I need to separate this string into items, where possible delimiters are
,
,
&
,
(space),
and
,
-
. But, and this is the point where I'm stuck, it should not split on a
-
when it is between two numbers.

I am using PHP and
preg_split
to do the actual splitting, but I need a regex pattern to match the delimiters excluding the delimiter
-
when it is between two numbers (digits, but could also be
123-456
). Suppression of spaces around each item is done with
trim()
in PHP.

I am using the following regex pattern:

/(and|,|\s|&)|\D(-)\D/


The output (after using
preg_split
, etc) is:

[0] => one
[1] => two
[2] => three
[3] => fou
[4] => ive
[5] => 123-456


The working is correct, but it also takes the last and first letter of the surrounding text for the
-
delimiter. The item
123-456
is correct, since it should not match (and split with
preg_split
) on a
-
when it is immediately surrounded by a number.

Expected output is:

[0] => one
[1] => two
[2] => three
[3] => four
[4] => five
[5] => 123-456


Any help is appreciated, if any information is lacking let me know and I'll update my question.

Answer

What you want to use is lookahead and lookbehind (more generally known as lookaround):

/and|,|\s|&|(?<!\d)-(?!\d)/

What this will do is exactly what the name implies - look around to check if the specified pattern is matched, without matching it. In this case, it'll only match a - that isn't surrounded on both sides by numeric characters (the \ds), but the match will only be the - itself.

In this case, (?<!\d) is a negative lookbehind - it will look backwards to see if the immediately preceding string does not match the pattern. If it does, it reports the match as failed and moves on. Likewise, (?!\d) is a negative lookahead - it does precisely the same thing, but in the opposite direction. Because the - is sandwiched between them, the effect is "match only a - if it does not have numeric characters on both sides".