Hello World Hello World - 10 months ago 108
Perl Question

how to decode_entities in utf8

In perl, I am working with the following utf-8 text:

my $string = 'a 3.9 kΩ resistor and a 5 µF capacitor';

However, when I run the following:

decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');

I get

a 3.9 kΩ resistor and a 5 µF capacitor

symbol has successfully decoded, but the
symbol now has gibberish before it.

How can I use decode_entities while making sure non-encoded utf-8 symbols (such as
) are not converted to gibberish?

Answer Source

You are using the Encode CPAN library. If that is true, you can try this...

my $string = "...";
$string = decode_entities(decode('utf-8', $string));

This may seem illogical. If Perl is natively UTF-8 itself, why should you need to decode a UTF-8 string? It is simply another way of telling Perl that you have a UTF-8 value that it needs to interpret as natively UTF-8.

The corruption you are seeing is when a UTF-8 value doesn't have the rights bytes recognized (it shows "0xC1 0xAF" when Dumpered; after the above change, it ought to show "0x1503", or some similar concat'ed bytes) .

There are a ton of settings that can affect this in perl. The above is most likely the right combination of changes that you need for your given settings. Otherwise, some variation (swap encode with decode('latin1', ...), etc.) of the above should solve the problem.