Håkon Hægland Håkon Hægland - 6 months ago 27
Perl Question

Using Term::ReadLine with Unicode input

I am trying to figure out how to read Unicode input from the terminal using

. It turns out, if I enter a Unicode character at the prompt, the returned string varies depending on various settings. (I am running Ubuntu 14.10, and have installed
Term::ReadLine::Gnu
). For example (
p.pl
):

use open qw( :std :utf8 );
use strict;
use warnings;

use Devel::Peek;
use Term::ReadLine;

my $term = Term::ReadLine->new('ProgramName');
$term->ornaments( 0 );
my $ans = $term->readline("Enter message: ");
Dump ( $ans );


Running
p.pl
and typing
å
at the prompt gives output:

Enter message: å
SV = PV(0x83a5a0) at 0x87c080
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x917500 "\303\245"\0
CUR = 2
LEN = 10


So the returned string
$ans
has not set the
UTF-8
flag. However, if I run the program using
perl -CS p.pl
, the output is:

Enter message: å
SV = PVMG(0x24c12e0) at 0x23050a0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x248faf0 "\303\245"\0 [UTF8 "\x{e5}"]
CUR = 2
LEN = 10


the UTF-8 flag is correctly set on
$ans
. So the first question is: Why is command line option
-CS
different from using the pragma
use open qw( :std :utf8 )
?

Next, I tested
Term::ReadLine::Stub
with
-CS
option:

$ PERL_RL=Stub perl -CS p.pl


the output is now:

Enter message: å
SV = PV(0xf97260) at 0xfd90c8
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x10746e0 "\303\203\302\245"\0 [UTF8 "\x{c3}\x{a5}"]
CUR = 4
LEN = 10


and the output string
$ans
has been doubly encoded, so the output is corrupted.. Is this a bug, or is it expected behavior?

Answer

Term::ReadLine does not read STDIN, it opens new filehandle. And so use open qw(:std :utf8); has no effect.

You need to do something like this:

my $term = Term::ReadLine->new('name');
binmode($term->IN, ':utf8');

Update about -CS:

Option -C sets some value to the magic variable ${^UNICODE}. -CS (or -CI) option makes expression ${^UNICODE} & 0x0001 true. And Term::ReadLine sets UTF-8 flag on for input string if ${^UNICODE} & 0x0001 is true.

Notice, option -CS is different from binmode($term->IN, ':utf8'). The first of which sets UTF-8 flag only, and the second encodes string.

Comments