n.r. n.r. - 4 months ago 8
Perl Question

Why does [^\w] match some word characters but not [^\p{Word}]?

I've written a Perl script that prints out characters matching a Unicode property. It seems to work all right for most properties so far.

But it prints out

ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
ÿ
among characters matching
[^\w]
. These characters should rather match
\w
. Strangely enough, they match
\p{Word}
.

I've tried without success:


  • map { decode ( "UTF-8", $_ ) }

  • map { pack 'U0C*', unpack 'C*', $_ }



How can I make
[^\w]
not match those word characters?

chars.pl



#!/usr/bin/perl

use warnings;
use strict;
use utf8;

binmode STDOUT, ':utf8';

my $c;
my $cols = 80;
my $arg = shift;
my $regex = qr/$arg/;

for ( map { chr } 0x20 .. 0xFFFF )
{
next if /\p{Unassigned}|\p{NChar}|\p{Cs}/;

if ( $_ =~ $regex )
{
print STDOUT;
print STDOUT "\n" if ++$c % $cols == 0;
}

}

print STDOUT "\n" if defined $c and $c % $cols != 0;
exit 0;


Good:

$ ./chars.pl '\p{Cyrillic}'
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя
ѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҇ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡ
ҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱ
ӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧᴫᵸⷠⷡⷢⷣⷤⷥⷦⷧⷨⷩⷪⷫⷬⷭⷮⷯⷰⷱⷲⷳⷴⷵⷶⷷ
ⷸⷹⷺⷻⷼⷽⷾⷿꙀꙁꙂꙃꙄꙅꙆꙇꙈꙉꙊꙋꙌꙍꙎꙏꙐꙑꙒꙓꙔꙕꙖꙗꙘꙙꙚꙛꙜꙝꙞꙟꙠꙡꙢꙣꙤꙥꙦꙧꙨꙩꙪꙫꙬꙭꙮ꙯꙰꙱꙲꙳꙼꙽꙾ꙿꚀꚁꚂꚃꚄꚅꚆꚇꚈꚉꚊꚋꚌꚍꚎꚏ
ꚐꚑꚒꚓꚔꚕꚖꚗ
$


Good:

$ ./chars.pl '[^\p{Word}]' | grep É
$


Bad:

$ ./chars.pl '[^\w]' | grep É
°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
$


Perl v5.14.2

Answer

Unicode support in Perl is huge topic, see e.g. this answer

For your program I managed to get [^\w] and [^\p{Word}] output same result by just adding this to start of the program:

use v5.14;

Alternatively adding following also works (enabled by use v5.14;):

use feature 'unicode_strings';