Hello World Hello World - 1 month ago 24
Perl Question

how to decode_entities in utf8

In perl, I am working with the following utf-8 text:

my $string = 'a 3.9 kΩ resistor and a 5 µF capacitor';


However, when I run the following:

decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');


I get

a 3.9 kΩ resistor and a 5 µF capacitor


The
symbol has successfully decoded, but the
µ
symbol now has gibberish before it.

How can I use decode_entities while making sure non-encoded utf-8 symbols (such as
µ
) are not converted to gibberish?

Answer

You are using the Encode CPAN library. If that is true, you can try this...

my $string = "...";
$string = decode_entities(decode('utf-8', $string));

This may seem illogical. If Perl is natively UTF-8 itself, why should you need to decode a UTF-8 string? It is simply another way of telling Perl that you have a UTF-8 value that it needs to interpret as natively UTF-8.

The corruption you are seeing is when a UTF-8 value doesn't have the rights bytes recognized (it shows "0xC1 0xAF" when Dumpered; after the above change, it ought to show "0x1503", or some similar concat'ed bytes) .

There are a ton of settings that can affect this in perl. The above is most likely the right combination of changes that you need for your given settings. Otherwise, some variation (swap encode with decode('latin1', ...), etc.) of the above should solve the problem.

Comments