eflores89 eflores89 - 3 months ago 13
HTML Question

UTF-8 encoding problems with R

Trying to parse Senate statements from the Mexican Senate, but having trouble with UTF-8 encodings of the web page.

This html comes through clearly:


Here is an example of a bit of the webpage:

"CONTINÚA EL SENADOR CORRAL JURADO: Nosotros decimos. Entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora. Una decisión de pre dictamen a lo mejor lo único que va a hacer es complicar más las cosas."

As can be seen, both accents and the "ñ" come through fine.

The issue arises in some other htmls (of the same domain!). For example:


I get:

"-EL C. DIPUTADO ADAME ALEMÃÂN: En consecuencia está a discusión la propuesta. Y para hablar sobre este asunto, se le concede el uso de la palabra a la senadora…….."

On this second piece I've tried iconv() and coercing the encoding parameter on html() to encoding="UTF-8" but keep getting the same results.

I've also checked the webpage encoding using W3 Validator and it seems to be UTF-8 and have no issues.

Using gsub does not seem efficient as the encoding downloads different characters with the same "code":

í - ÃÂ
á - ÃÂ
ó - ÃÂ

Pretty much fresh out of ideas.

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] grDevices utils datasets graphics stats grid methods base

other attached packages:
[1] stringi_0.4-1 magrittr_1.5 selectr_0.2-3 rvest_0.2.0 ggplot2_1.0.0 geosphere_1.3-11 fields_7.1
[8] maps_2.3-9 spam_1.0-1 sp_1.0-17 SOAR_0.99-11 data.table_1.9.4 reshape2_1.4.1 xlsx_0.5.7
[15] xlsxjars_0.6.1 rJava_0.9-6

loaded via a namespace (and not attached):
[1] bitops_1.0-6 chron_2.3-45 colorspace_1.2-4 digest_0.6.8 evaluate_0.5.5 formatR_1.0 gtable_0.1.2
[8] httr_0.6.1 knitr_1.8 lattice_0.20-29 MASS_7.3-35 munsell_0.4.2 plotly_0.5.17 plyr_1.8.1
[15] proto_0.3-10 Rcpp_0.11.3 RCurl_1.95-4.5 RJSONIO_1.3-0 scales_0.2.4 stringr_0.6.2 tools_3.1.2
[22] XML_3.98-1.1

This seems to be the issue:

[1] "ASCII" "latin1" "latin1" "ASCII" "ASCII" "latin1" "ASCII" "ASCII" "latin1"

... and so forth. Clearly, the issue is in latin1:


How can I coerce the latin1 to correct UTF-8 strings? When "translated" by stringi It appears to be doing it wrong, giving me the issues described earlier.


Encodings are one of 21st century's worse headaches. But here's a solution for you:

# Set-up remote reading connection, specifying UTF-8 as encoding.
addr <- "http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html"
read.html.con <- file(description = addr, encoding = "UTF-8", open = "rt")

# Read in cycles of 1000 characters
html.text <- c()
i = 0
while(length(html.text) == i) {
    html.text <- append(html.text, readChar(con = read.html.con,nchars = 1000))
    cat(i <- i + 1)

# close reading connection

# Paste everything back together & at the same time, convert from UTF-8 
# to... UTF-8 with iconv(). I know. It's crazy. Encodings are secretely 
# meant to drive us insane.
content <- paste0(iconv(html.text, from="UTF-8", to = "UTF-8"), collapse="")

# Set-up local writing
outpath <- "~/htmlfile.html"

# Create file connection specifying "UTF-8" as encoding, once more
# (Although this one makes sense)
write.html.con <- file(description = outpath, open = "w", encoding = "UTF-8")

# Use capture.output to dump everything back into the html file
# Using cat inside it will prevent having [1]'s, quotes and such parasites
capture.output(cat(content), file = write.html.con)

# Close the output connection

Then you're ready to open your newly created file in your favorite browser. You should see it intact and have it ready to be reopened with the tools of your choosing!