Artalus Artalus - 4 months ago 22
Perl Question

Perl: regex won't work without parentheses

I am writing a simple script in Perl to check string for different wordforms (in english and russian) of a nickname. I would use the next regex:

/(gunn?er|gunn?|ганн?еру?|ганн?у?)/i
- which is valid, according to regex101.com test and Notepad++. However, on my computer in Perl this regex doesn't work unless I put additional parentheses to
?
and
|
:
/((gun(n)?er)|(gun(n)?)|(ган(н)?ер(у)?)|(ган(н)?(у)?)/i
. My friend, whom I've asked of this, couldn't reproduce this behavior. Is it some kind of setting of script or Perl interpreter itself that I should change?

Edit: As requested, the code of my tests:

#!/usr/bin/perl
my $GUN = "gunner";
my $HZ = "!!!";

sub GetNickFromMsg
{
my ($msg) = @_;
if ( $msg =~ /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i )
{
return $GUN
}
return $HZ;
}

my @nicks = ("Gunner", "guner", "ганнер", "ганеру", "гану");
foreach $n (@nicks)
{
my $res = GetNickFromMsg($n);
print "$n -> $res\n");
}


The output I get:

Gunner -> !!!
guner -> !!!
ганнер -> !!!
ганеру -> !!!
гану -> !!!


If I change the regex to the second version, with parentheses everywhere, the output for every wordform is "-> gunner" as it should be. I've tried to add
use feature 'unicode_strings'
to the beginning of the script and use
ui
instead of
i
modifier as Casimir supposed, but it didn't help.

I launch the script on Linux server,
Linux version 4.3.0-1-amd64 (debian-kernel@lists.debian.org) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP Debian 4.3.3-5 (2016-01-04)
with Perl version 5.22.1

Answer

You need to add use utf8 at the top of your program to specify that your program code uses UTF-8-encoded characters

You will also need to set STDOUT to handle UTF-8 encoding, otherwise you will get Wide character in print warnings

Here's an edited version of your program that works correctly and provides the behaviour that you expected

#!/usr/bin/perl

use utf8;
use strict;
use warnings 'all';

use open qw/ :std :encoding(UTF-8) /;

my $GUN = 'gunner';
my $HZ  = '!!!';

sub GetNickFromMsg {
    my ($msg) = @_;

    if ( $msg =~ /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i ) {
        return $GUN;
    }

    return $HZ;
}

my @nicks = qw/ Gunner guner ганнер ганеру гану /;

foreach my $n (@nicks) {
    my $res = GetNickFromMsg($n);
    print "$n -> $res\n";
}

output

Gunner -> gunner
guner -> gunner
ганнер -> gunner
ганеру -> gunner
гану -> gunner