David Ljung Madison David Ljung Madison - 4 months ago 26
Perl Question

perl binmode of utf-8 is only working with \x{codepoint} and not \x encoding for three byte encoding

The Euro character is 0xe282ac in UTF-8

I'm trying to use a string in perl with the UTF-8 character output to STDOUT.

So I set my script to be in UTF-8 with 'use utf8;'

And I set up my STDOUT to be in UTF-8 with 'binmode'.

An example script is:

use utf8;
binmode STDOUT, ':utf8';
print "I owe you 160\x{20ac}\n";
print "I owe you 80\xe2\x82\xac\n"; # UTF-8 encoding?


The \x{codepoint} works fine, but encoding the UTF-8 gives me an error:

I owe you 160€
I owe you 80â¬

Answer

If you want a string that consists of the three bytes E2 82 AC, you can declare it like this:

my $bytes = "\xE2\x82\xAC";

The \xXX form in a double quoted string uses two hex digits (and always two) to represent one byte.

The string above contains 3 bytes. If we pass the string to the length function it will return 3:

say 'Length of $bytes is: ' . length($bytes);    # 3

Perl has no way of knowing whether those three bytes are intended to represent the Euro symbol. They could equally be a three byte sequence from inside a JPEG file, or a ZIP file, or an SSL-encoded TCP data stream traversing a network. Perl doesn't know or care - it's just three bytes.

If you actually want a string of characters (rather than bytes) then you need to provide the character data in a way that allows Perl to use its internal representation of Unicode characters to store them in memory. One way is to provide the non-ASCII characters in UTF8 form in the source code. If you're doing this you'll need to say use utf8 at the top of your script to tell the Perl interpreter to treat non-ASCII string literals as utf8:

use utf8;

my $euro_1 = "€";

Alternatively you can use the form \x{X...} with 1-5 hex characters representing the Unicode codepoint number. This will declare an identical string:

my $euro_2 = "\x{20ac}";

Each of these strings contains a multi-byte representation of the euro character in Perl's internal encoding. Perl knows the strings are character strings so the length function will return 1 (for 1 character) in each case:

say 'Length of $euro_1 is: ' . length($euro_1);    # 1
say 'Length of $euro_2 is: ' . length($euro_2);    # 1

The defining feature of Perl's internal representation of character strings is that it is for use inside Perl. If you want to write the data out to a file or a socket, you'll need to encode the character string to a sequence of bytes:

use Encode qw(encode);

say encode('UTF-8', $euro_1);

It's also possible to use binmode or an argument to open to say that any string written to a particular filehandle should be encoded to a specific encoding.

binmode(STDOUT, ':encoding(utf-8)');

say $euro_1;

This will only work correctly for character strings. If we took our original 3-byte string $bytes and used either encode or IO layers, we would end up with garbage, because Perl would take each byte and convert it to UTF8. So \xE2 would be output as \xC3\xA2, \x82 would be output as \xC2\x82 and so on.

However, we can use the Encode::Decode function to convert the 3-byte $bytes string into a single character string in Perl's internal character representation:

use Encode qw(decode);

my $bytes = "\xE2\x82\xAC";
my $euro_3 = decode($bytes);

say 'Length of $euro_3 is ' . length($euro_3);    # 1

One minor nitpick: In your original question you stated that 20AC is the UTF-16 representation of the euro symbol. In fact there are two different UTF-16 representations: UTF16BE and UTF16LE, with the latter using the opposite order: AC20.

Comments