Jordan Jordan - 1 year ago 69
Perl Question

Perl phone-number regex

Sorry for asking such a simple question, I'm still an inexperienced programmer. I stumbled across a phone-number-matching regex in some old perl code at work, I'd love it if somebody could explain exactly what it means (my regex skills are severely lacking).

if ($value !~ /^\+[[:space:]]*[0-9][0-9.[:space:]-]*(\([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*\))?([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?([[:space:]]+ext.[0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?$/i) {

Thank you in advance :)

Answer Source

The code roughly says "you should replace this with Number::Phone".

All joking and good advice aside, first thing to do when figuring out a regex is to expand it with /x. First pass is to break things up by capture group.


Then, since this is dominated by character sets, I'd space by character sets.

 \+ [[:space:]]* [0-9] [0-9.[:space:]-]*
 ( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
 ( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?
 ( [[:space:]]+ ext . [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?

Now you can start to see some similar elements. Try lining those up to see the similarities.

 \+        [[:space:]]* [0-9] [0-9.[:space:]-]*
 ( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
 (    [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]*    )?
 ( [[:space:]]+ 
   ext . 
      [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* 

Then zero in on an element and try figure it out. This is the important one, [0-9.[:space:]-]* meaning "Zero or more numbers, spaces, dashes or dots". That doesn't make much sense for phone parsing, maybe it will make more sense in context. Let's look at a line we can guess what it's trying to do.

( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
  • Open paren.
  • Zero or more numbers, spaces, dashes or dots.
  • A number
  • Zero or more numbers, spaces, dashes or dots.
  • Close paren.

The parens suggest this is trying to parse an area code. The rest limits it to any number of numbers, spaces, dashes or dots, but the [0-9] ensures there is at least one number. This is likely the author's way of dealing with the multitude of phone number formats.

Let's give this a name, call it phone_chars, because it's what the author has decided phone numbers are made of. There's another element, the [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* which I'll call a "phone atom" because it's what the author decided an atom of a phone number can be. If we put that in its own regex and build the phone regex with it, things become a lot clearer.

my $phone_chars = qr{[0-9.[:space:]-]};
my $phone_atom  = qr{$phone_chars* [0-9] $phone_chars*}x;

 \+ [[:space:]]* [0-9] $phone_chars*
 ( \( $phone_atom \) )?
 (    $phone_atom    )?
 ( [[:space:]]+ ext . $phone_atom )?

If you know something about phone numbers, it's like this:

  1. Mandatory country code (which must start with a + and a number)
  2. Optional area code
  3. Optional phone number
  4. Optional extension

This regex doesn't do a very good job validating phone numbers. According to this regex, "+1" is a valid phone number, but "(555) 123-4567" isn't because it doesn't have a country code.

Phone number validation is hard. Did I mention Number::Phone?