SY Kim SY Kim - 1 month ago 18
R Question

URL encoding in R - giving different result?

I'm doing web scrapping in South Korean newspaper websites, but I'm having trouble dealing with url encoding. The original keyword was "실업률"(unemployment rate), and first I tried using [URLencode] and [curlEscape] functions(i.e. url_key <- URLencode("실업률")). Both gave me same result,

"%BD%C7%BE%F7%B7%FC"

But this did not work properly during scrapping. On the other hand, using URL encoding site (http://meyerweb.com/eric/tools/dencoder/), I got

"%EC%8B%A4%EC%97%85%EB%A5%A0"

and it worked well.

But still, I don't know what caused different output and how to get the latter output in R. Thanks for responds in advance.

(Responding to comments, I added my result of sessionInfo() below)

R version 3.2.4 (2016-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949
[3] LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C
[5] LC_TIME=Korean_Korea.949

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] RCurl_1.95-4.8 bitops_1.0-6 plyr_1.8.4 stringr_1.1.0
[5] XML_3.98-1.4

loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.4 Rcpp_0.12.7 stringi_1.1.2

Answer

Taking 실 (U+C2E4), we see that it's UTF-8 value is 0xEC 0x8B 0xA4 (3 bytes). That matches the expected URL-encoding. It appears that your incorrect result is caused by another character set (EUC-KR?)