ErrantBard ErrantBard - 9 days ago 3
R Question

Data.table, logical comparison and encoding bugs/errors in non-English enviroment

Data table gives a warning, even if encodings are not mixed and are known. The only time a merge doesn't give any warning is when the encoding is set to unknown on both of them. This doesn't seem to be right, logical comparisons seems to act differently and ignores encoding.

I have two questions, why does data-table have this behavior when both encodings are known and the same. I guess it's a bug on the basis of the warning (albeit a small one)?

The last merge, that fails is perhaps desired behavior, but shouldn't then the logical comparison also fail? Which brings me to the second question, what's the difference with a data.table join and a logical comparison since in my last merge they give different results?

Logical comparisons seems more robust in face of encoding issues.

Code and re-producable output below.

sessionInfo()
below that.

library("data.table")

d.tst <- data.table(Nr = c("ÅÄÖ", "ÄÖR"))
d.tst2 <- data.table(Nr2 = c("ÅÄÖ", "ÄÖR"),
Dat = c(1, 2))

Encoding(d.tst$Nr)
# [1] "latin1" "latin1"
Encoding(d.tst2$Nr2)
# [1] "latin1" "latin1"

d.tst[1]$Nr == d.tst2[1]$Nr2
# [1] TRUE
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")



Warning message:
In
bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,
:
A known encoding (latin1 or UTF-8) was detected in a join column.
data.table compares the bytes currently, so doesn't support mixed
encodings well; i.e., using both latin1 and UTF-8, or if any unknown

encodings are non-ascii and some of those are marked known and others not.
But if either latin1 or UTF-8 is used exclusively, and all unknown
encodings are ascii, then the result should be ok. In future we will check
for you and avoid this warning if everything is ok. The tricky part is
doing this without impacting performance for ascii-only cases.


d.tst$Nr <- iconv(d.tst$Nr, "LATIN1", "UTF-8")
d.tst2$Nr2 <- iconv(d.tst2$Nr2, "LATIN1", "UTF-8")

Encoding(d.tst$Nr)
# [1] "UTF-8" "UTF-8"
Encoding(d.tst2$Nr2)
# [1] "UTF-8" "UTF-8"

a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")



Warning message:
In
bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,
:
A known encoding (latin1 or UTF-8) was detected in a join column.
data.table compares the bytes currently, so doesn't support mixed
encodings well; i.e., using both latin1 and UTF-8, or if any unknown

encodings are non-ascii and some of those are marked known and others not.
But if either latin1 or UTF-8 is used exclusively, and all unknown
encodings are ascii, then the result should be ok. In future we will check
for you and avoid this warning if everything is ok. The tricky part is
doing this without impacting performance for ascii-only cases.


d.tst$Nr <- iconv(d.tst$Nr, "UTF-8", "cp1252")
d.tst2$Nr2 <- iconv(d.tst2$Nr2, "UTF-8", "cp1252")

Encoding(d.tst$Nr)
# [1] "unknown" "unknown"
Encoding(d.tst2$Nr2)
# [1] "unknown" "unknown"

a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

# Here we change the encoding on only one data.table

d.tst$Nr <- iconv(d.tst$Nr, "cp1252", "UTF-8")

#Check encoding
Encoding(d.tst$Nr)
# [1] "UTF-8" "UTF-8"
Encoding(d.tst2$Nr2)
# [1] "unknown" "unknown"

# Logical comparison
d.tst[1]$Nr == d.tst2[1]$Nr2
# [1] TRUE

# This merge fails completely, not just a warning, even if logic says they are the same
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")



Warning message:
In
bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,
:
A known encoding (latin1 or UTF-8) was detected in a join column.
data.table compares the bytes currently, so doesn't support mixed
encodings well; i.e., using both latin1 and UTF-8, or if any unknown

encodings are non-ascii and some of those are marked known and others not.
But if either latin1 or UTF-8 is used exclusively, and all unknown
encodings are ascii, then the result should be ok. In future we will check
for you and avoid this warning if everything is ok. The tricky part is
doing this without impacting performance for ascii-only cases.


sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C
[5] LC_TIME=Swedish_Sweden.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.9.6 RODBC_1.3-13

loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.2 assertthat_0.1 DBI_0.4-1 tools_3.3.1 tibble_1.1 Rcpp_0.12.5 chron_2.3-47

Answer

As of the new data.table version 1.9.8 this should be fixed. I ran the same script (as above) in bot 1.9.6 and 1.9.8 and it failed in the first but went beautiful in the last. So this should be solved now.