I have a simple web scraper that seems to behave strangely:
- in the desktop version of RStudio (running R version 3.3.3 on Windows) it behaves as expected and produces a numeric vector
- in the server version of RStudio (running R version 3.4.1 on Linux) the
url <- "http://benzin.impuls.cz/benzin.aspx?strana=3"
impuls <- read_html(url, encoding = "windows-1250")
asdf <- impuls %>%
Benzin <- asdf[]$X7
chrBenzin <- gsub("\\sKč","",Benzin) # something is wrong here...
numBenzin <- as.double(chrBenzin)
The whitespace in the values is a hard space,
U+00A0. After I ran the code, I got this output for
Benzin (copy/pasted at ideone.com):
Then, I was already sure those were hard spaces, but I doubled checked here.
What shall we do when we have hard spaces is to try two alternatives.
One is using
[[:space:]] in a TRE (default regex engine in Base R functions).
The other is using a PCRE regex with a
(*UCP) verb at the start to let the regex engine know we deal with Unicode.
In your case, on Linux, it seems the PCRE works, thus you should stick to the PCRE version (that is just more consistent than TRE):
A quick online test on Linux R:
Benzin <- "29.60 Kč" gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE) ## =>  "29.60"