lemzwerg lemzwerg - 2 months ago 6
Perl Question

perl's $-[0] produces unexpected results for non-ASCII data

Consider the following input data in file

y.txt
(encoded in UTF-8).

bar
föbar


and a file
y.pl
, which puts the two input lines into an array and processes them, looking for substring start positions.

use open qw(:std :utf8);

my @array;

while (<>) {
push @array, $_;
print $-[0] . "\n" if /bar/;
}

# $array[0] = "bar", $array[1] = "föbar"
print $-[0] . "\n" if $array[1] =~ /$array[0]/u;


If I call
perl y.pl < y.txt
, I get

0
2
3


as the output. However, I would expect that the last number is 2 also, but for some reason the second
/.../
regexp behaves differently. What am I missing? I guess it's an encoding issue, but whatever I tried, I didn't succeed. This is Perl 5.18.2.

Answer

It appears to be a bug in 5.18.

$ 5.18.2t/bin/perl a.pl a
0
2
3

$ 5.20.1t/bin/perl a.pl a
0
2
2

I can't find a workaround. Adding utf8::downgrade($array[0]); or utf8::downgrade($array[0], 1); works in the case you presented, but not using the following data or any other where the interpolated pattern contains characters >255.

♠bar
f♠♠bar

It appears that this can only be fixed by upgrading your Perl, which is actually quite simple. (Just make sure to install it to a different directory than your system perl by following the instructions in INSTALL!)