Georg Georg - 1 month ago 9
Perl Question

Perl: tr/// is not doing what I expect whereas s/// is

I want to remove diacritic signs in some strings.

tr///
should do the job but fails (see below). I thought I had an encoding/decoding problem, but I noticed
s///
works as I expect. Could somebody explain why?

Here is an example of results I get:

my $str1 = 'èîü';
my $str2 = $str1;
$str1 =~ tr/î/i/;
print "$str1\n"; # => i�iii�
$str2 =~ s/î/i/;
print "$str2\n"; # => èiü


Note that
tr///
also modified the first and third characters of the string, not just the middle one.

Edit: I use Ubuntu 16.04 with Mate desktop environment.

Answer

When you don't have use utf8;, but you are viewing the code with a utf8 text editor, you're not seeing it the way perl sees it. You think you have a single character in the left half of your s/// and tr/// but because it's multiple bytes, perl sees it as multiple characters.

With s///, since none of the characters are regexp operators, you're just doing a substring search. You're searching for a multi-character substring. And you find it, because the same thing that happened in your s/// is also happening in your string literals: the characters you think are in there really aren't, but the multi-character sequence is.

In tr/// on the other hand, multiple characters aren't treated as a sequence, they're treated as a set. Each character (byte) is handled separately when it is found. And that doesn't get you the results you want, because changing the individual bytes of a utf8 string is never what you want.

The fact that you can run simple ASCII-oriented substring search that knows nothing about utf8, and get the correct result on a utf8 string, is considered a good backward-compatibility feature of utf8, as opposed to other encodings like ucs2/utf16 or ucs4.

Comments