Peter V. Mørch Peter V. Mørch - 4 years ago 168
Perl Question

Perl UTF-8/encoding: How to detect $str1 eq $str2 to avoid MySQL round-trip?

A caller to a sub has given me a value

$new_value
. I have selected a value from a MySQL database into scalar
$current_value
. I cannot figure out how to reliably detect whether they're "identical". By identical I mean: If I update a database record with
$new_value
will that change the database state?

Boiling this down to its essence:

#!/usr/bin/perl -w
use utf8;
use strict;
use Encode qw(encode);
my $str = 'æøå';
my $latin1 = encode('latin1', $str);

# This in fact doesn't die. They're eq
$str eq $latin1
or die;


If I update a field in a MySQL database with
$str
, I get one value back if I re-select it - a UTF-8 encoded value. Using
$latin1
, the database field ends up with another value - a latin1/ISO-8859-1 encoded value.

The original problem I'm debugging updates a field with
CHARSET=latin1
but the symptoms show up equally well with a simple:

my $dbh = DBI->connect(
"DBI:mysql:mysql",
'user',
'pass',
# No, we don't have these options on our DB handles
# Introducing them now would causes (too) many regression issues
# for us, as in other places also, values are latin1 encoded,
# not UTF-8 encoded.
# mysql_enable_utf8 => 1,
# mysql_enable_utf8mb4 => 1`
);
my $sth = $dbh->prepare('SELECT CONCAT(?)')
or die;
$sth->execute($val);
my ($return_val) = $sth->fetchrow_array();


Since
$str
and
$latin1
result in different values after a MySQL round-trip, I'd like to detect that they are, in fact, not equal. So assuming the current value in the database is a correctly encoded latin1
æøå
that I've already
SELECT
-ed that into a
$current_value
scalar then my question boils down to coding:

sub new_value_will_change_database {
my ($current_value, $new_value) = @_;
# How to write I write this sub, so it returns true for $str
# and false for $latin1 from above?
...
}


How do I accomplish that? The only difference I've been able to detect, is that the UTF8-flag is enabled on
$str
, but not on
$latin1
. However, I also seem to remember that if I'm checking the UTF-8 flag, my code is broken...

Fuller Debugging Script



#!/usr/bin/perl -w

use utf8;
use strict;
use feature qw(:5.10);

use Encode qw(encode is_utf8);
use DBI;
use Data::Dumper;

my $str = 'æøå';

my $latin1 = encode('latin1', $str);

my $utf8upgraded = $latin1;
utf8::upgrade($utf8upgraded);

# $str, $latin1 and $utf8upgraded are all eq each other:
$str eq $latin1
or die;
$str eq $utf8upgraded
or die;
$latin1 eq $utf8upgraded
or die;

my $dbh = DBI->connect(
"DBI:mysql:mysql",
'user',
'pass',
);

my $sth = $dbh->prepare('SELECT CONCAT(?)')
or die;

sub mysql_roundtrip {
my ($val) = @_;
$sth->execute($val);
my ($concat) = $sth->fetchrow_array();
return $concat;
};

foreach my $set (
[ 'str', $str ],
[ 'latin', $latin1 ],
[ 'utf8upgraded', $utf8upgraded ],
) {
my ($disp, $val) = @$set;

my $hex = $val;
$hex =~ s/(.)/sprintf "%X", ord($1)/ge;

my $dumper = Data::Dumper->new([substr $val, 0, 1])->Terse(1)->Dump;
chomp $dumper;
printf "%-13s: val:%s mysql:%s is_utf8:%d hex:%s dumper0:%s\n",
$disp,
$val,
mysql_roundtrip($val),
is_utf8($val),
$hex,
$dumper;
}


produces this output:

str : val:��� mysql:æøå is_utf8:1 hex:E6F8E5 dumper0:"\x{e6}"
latin : val:��� mysql:��� is_utf8:0 hex:E6F8E5 dumper0:'�'
utf8upgraded : val:��� mysql:æøå is_utf8:1 hex:E6F8E5 dumper0:"\x{e6}"

Answer Source

Here is the solution we're going with (for now at least):

sub mysql_value_latin1 {
    my ($val) = @_;
    # See text - this looks strange - but works!
    if (is_utf8($val)) {
        $val = encode('utf8', $val);
    } else {
        $val = encode('latin1', $val);
    }
    return $val;
}

sub new_value_will_change_database {
    my ($current_value, $new_value) = @_;
    my $mysql_new_value = mysql_value_latin1($new_value);
    return $current_value ne $mysql_new_value;
}

Thanks @ikegami and @HelmutWollmersdorfer for your input on this issue. You both suggest these options for the $dbh:

mysql_enable_utf8 => 1,     # Decodes string received from the DB.
mysql_enable_utf8mb4 => 1,  # Sets the encoding used for the connection.

As I've pointed out that will cause an unpredictable amount of regressions in our code base since the handles are shared by many libraries.

The advantages of mysql_enable_utf8 => 1 are clear: Perl code sends correctly encoded UTF-8 data to MySQL, which then converts it to Latin1 (CP 1252) and puts it in the database. We are guaranteed that data is stored properly, and we can use UTF-8 in Perl, not caring about the Latin1-ness of the database.

There are disadvantages also: Any data that is not valid UTF-8 will be rejected by DBI or DBD::mysql (I'm not clear which), and my testing has also shown that MySQL will reject storing data in a Latin1 table that isn't valid Latin1 (CP 1252). So we need to be more explicit about encoding our data before sending it to the database - probably a good thing actually.

mysql_enable_utf8 => 0 appears to behave quite strangely. It appears, that if the UTF-8 flag on Perl scalars is set, then data will be UTF-8 encoded and otherwise data will be left at Perl's internal encoding (ISO-8859-1/Latin1). This data is then sent to MySQL and stored in the Latin1 table, wether or not the data actually is valid CP 1252 data. With mysql_enable_utf8 => 0 I was able to store all chars in 0x00-0xFF without problems, even though some are not valid CP 1252 characters.

If anybody can find a failing test for @tests, please let me know.

The task at hand in the OP was to predict whether a given scalar would change the value of the database if handed to MySQL for an UPDATE, and the sub new_value_will_change_database does exactly that - without changing properties of the $dbh. That is why I prefer this solution to the OP.

I do agree that a better technical solution is to go the mysql_enable_utf8 => 1 route, but it is also the poorer business decision because of the effort involved to address the (potential) regressions.

Fuller Debugging Script

#!/usr/bin/perl -w

use utf8;
use strict;
use feature qw(:5.10);

use Encode qw(encode is_utf8);
use DBI;
use Data::Dumper;

my $str = 'æøå';

my $latin1 = encode('latin1', $str);

my $utf8upgraded = $latin1;
utf8::upgrade($utf8upgraded);

# $str, $latin1 and $utf8upgraded are all eq each other:
$str eq $latin1
    or die;
$str eq $utf8upgraded
    or die;
$latin1 eq $utf8upgraded
    or die;

my $dbh = DBI->connect(
    "DBI:mysql:mysql",
    'user',
    'pass',
);

$dbh->do(q(
    CREATE TEMPORARY TABLE test (
        name VARCHAR(255) DEFAULT NULL
    ) CHARSET=latin1;
));
$dbh->do(q(
    INSERT INTO test (name) VALUES ('');
));

sub mysql_roundtrip_convert {
    my ($val) = @_;
    my $sth = $dbh->prepare('SELECT CONVERT(? USING LATIN1)');
    $sth->execute($val);
    my ($concat) = $sth->fetchrow_array();
    return $concat;
}

sub mysql_roundtrip_column {
    my ($val) = @_;
    my $updateSth = $dbh->prepare('update test set name=?');
    $updateSth->execute($val);
    my $getSth = $dbh->prepare('select name from test');
    $getSth->execute();
    my ($value) = $getSth->fetchrow_array();
    return $value;
};

sub mysql_roundtrip {
    my ($val) = @_;
    # Check that these two are the identical:
    my $column = mysql_roundtrip_column($val);
    my $convert = mysql_roundtrip_convert($val);
    $column eq $convert
        or die "column ne convert";
    return $column;
}

sub mysql_value_latin1 {
    my ($val) = @_;
    # See text - this looks strange - but works!
    if (is_utf8($val)) {
        $val = encode('utf8', $val);
    } else {
        $val = encode('latin1', $val);
    }
    return $val;
}

sub new_value_will_change_database {
    my ($current_value, $new_value) = @_;
    my $mysql_new_value = mysql_value_latin1($new_value);
    return $current_value ne $mysql_new_value;
}

my @tests = (
    [ 'str', $str ],
    [ 'latin', $latin1 ],
    [ 'utf8upgraded', $utf8upgraded ],
    map { [ 'char ' . $_ , 'char' . chr($_) ] } ( 0x00 .. 0xFF ),
);

foreach (@tests) {
    my ($disp, $val) = @$_;
    my $mysql_roundtrip = mysql_roundtrip($val),
    my $mysql_value_latin1 = mysql_value_latin1($val);
    $mysql_value_latin1 eq $mysql_roundtrip
        or die "mysql_value_latin1 ne mysql_roundtrip";
}
print "All test are fine\n";
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download