Beano Beano - 5 months ago 20
Perl Question

Should I use \d or [0-9] to match digits in a Perl regex?

Having read a number of questions/answers over the past few weeks, I have seen the use of

\d
in perl regular expressions commented on as incorrect. As in the later versions of perl
\d
is not the same as
[0-9]
, as
\d
will represent any Unicode character that has the digit attribute, and that
[0-9]
represents the characters '0', '1', '2', ..., '9'.

I appreciate that in some contexts
[0-9]
will be the correct thing to use, and in others
\d
will be. I was wondering which people feel is the correct default to use?

Personally I find the
\d
notation very succinct and expressive, whereas in comparison
[0-9]
is somewhat cumbersome. But I have little experience of doing multi-language code, or rather code for languages that do not fit into the ASCII character range, and therefore may be being naive.

I notice

$find /System/Library/Perl/5.8.8/ -name \*pm | xargs grep '\\d' | wc -l
298
$find /System/Library/Perl/5.8.8/ -name \*pm | xargs grep '\[0-9\]' | wc -l
26

Answer

For maximum safety, I'd suggest using [0-9] any time you don't specifically intend to match all unicode-defined digits.

Per perldoc perluniintro, Perl does not support using digits other than [0-9] as numbers, so I would definitely use [0-9] if the following are both true:

  1. You want to use the result as a number (such as performing mathematical operations on it or storing it somewhere that only accepts proper numbers (e.g. an INT column in a database)).

  2. It is possible non-digits [^0-9] would be present in the data in such a way that the regular expression could match them. (Note that this one should always be considered true for untrusted/hostile input.)

If either of these are false, there will only rarely be reason to specifically not use \d (and you'll probably be able to tell when that is the case), and if you're trying to match all unicode-defined digits, you'll definitely want to use \d.