Nathann Cohen Nathann Cohen - 1 month ago 5
Perl Question

Perl regex replacement of logical unicode characters

Here is a simple substitution that adds parentheses arounds upper-case characters in an unicode string. As you can see, the result is rather ugly:

~$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5

My understanding is that the regex operates on "code points" instead of "logical characters", which splits my 'é' into meaningless characters. Is there a way to force the regex to work on logical unicode characters at once ?



Assuming that your terminal uses UTF-8 encoding,

$ echo -n "é" | perl -ne 'printf "%vX\n", $_'



so the input to the Perl program has not been converted internally to Unicode (it is still a string of UTF-8 bytes)

To convert the input to a Perl string, add a UTF-8 layer on the standard input stream using option -CI :

$ echo -n "é" | perl -CI -ne 'printf "%vX\n", $_'

the output is now


However, if you also try to print the character back to standard output you will not get é but a unicode replacement character from the terminal. This is because the character 0xE9 is Unicode, but the terminal expect UTF-8, and 0xE9 is not valid UTF-8:

$ echo -n "é" | perl -CI -nE 'printf "$_: %vX\n", $_, $_'
�: E9

To get correct output, you can add an UFT-8 encoding layer on the standard output stream also (using -CO flag):

$ echo -n "é" | perl -CIO -nE 'printf "$_: %vX\n", $_, $_'
é: E9

According to perlunicode

"Upper" is a synonym for "Uppercase" , and we could have written \p{Uppercase} equivalently as \p{Upper}


For instance, \p{Uppercase} matches any single character with the Unicode "Uppercase" property

It seems like if you try to use \p{Upper} on a byte string, you will not get any warnings from Perl. Also bytes in the range 0xC0 to 0xDE will match the uppercase property. Try

perl -E 'for $i (0x80..0xFF) {$_=chr $i; printf "%x\n", $i if /\p{Upper}/}'

This explains the output you got:

$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5

Here, the letter é is represented as 2 bytes (in UTF-8) 0xC3 and 0xA9, and 0xC3 will match the Unicode Upper property.

A solution to your problem is therefore to add UTF-8 encoding layers on the standard input and output (you can combine -CI and -CO using -CS):

echo "Whatéver 5" | perl -CS -ape "s/(\p{Upper})/(\1)/g"

with output:

(W)hatéver 5

For more information on Unicode handling in Perl see Command line switches for one-liners in the SE Documentation for Perl