René Nyffenegger René Nyffenegger - 4 months ago 8
Perl Question

Are there any gotchas with open(my $f, '<:encoding(UTF-8)', $n)

I am having a problem that I am unable to reproduce in a manner suitable for Stackoverflow although it's reproducable in my production environment.

The problem occors in a Perl script that, among others, iterates over a file that looks like so:

abc-4-9|free text, possibly containing non-ascii characters|
cde-3-8|hällo wörld|
# comment

xyz-9-1|and so on|
qrs-2-8|and so forth|


I can verify the correctness of the file with this Perl script:

use warnings;
use strict;

open (my $f, '<:encoding(UTF-8)', 'c:\path\to\file') or die "$!";

while (my $s = <$f>) {
chomp($s);
next unless $s;
next if $s =~ m/^#/;
$s =~ m!(\w+)-(\d+)-(\d+)\|([^|]*)\|! or die "\n>$s<\n didn't match on line $.";
}

print "Ok\n";
close $f;


When I run this script, it won't die on line 10 and consequently print
Ok
.

Now, I use essentially the same construct in a huge Perl script (hence irreproducable for Stackoverflow) and it will die on line 2199 of the input file.

If I change the first line (which is completely unrelated to line 2199) from something like

www-1-1|A line with some words|


to

www-1-1|x|


the script will process line 2199 (but fail later).

Interestingly, this behaviour was introduced when I changed

open (my $f, '<', 'c:\path\to\file') or die "$!";


to

open (my $f, '<:encoding(UTF-8)', 'c:\path\to\file') or die "$!";


Without the
:encoding(UTF-8)
directive, the script does not fail. Of course, I need the encoding directive since the file contains non-ascii characters.

BTW, the same script runs without problems on Linux.

On Windows, where it fails, I use Strawberry Perl 5.24

Answer

I do not have a full and correct explanation of why this is necessary, but you can try opening the file with

'<:unix:encoding(UTF-8)'

This may be related to my question "Why is CRLF set for the unix layer on Windows?" which I noticed when I was trying to figure out stuff which I ended up never figuring out.

Comments