Jindra Lacko Jindra Lacko - 2 months ago 19
R Question

Strange behaviour of regex in R

I have a simple web scraper that seems to behave strangely:

- in the desktop version of RStudio (running R version 3.3.3 on Windows) it behaves as expected and produces a numeric vector

- in the server version of RStudio (running R version 3.4.1 on Linux) the

gsub()
(and hence the numeric conversion afterwards) fails, and the code produces a vector of
NA
s.

Do you have any idea what could cause the difference?

library(rvest)
library(stringr)

url <- "http://benzin.impuls.cz/benzin.aspx?strana=3"
impuls <- read_html(url, encoding = "windows-1250")

asdf <- impuls %>%
html_table()

Benzin <- asdf[[1]]$X7

chrBenzin <- gsub("\\sKč","",Benzin) # something is wrong here...

numBenzin <- as.double(chrBenzin)
numBenzin

Answer Source

The whitespace in the values is a hard space, U+00A0. After I ran the code, I got this output for Benzin (copy/pasted at ideone.com):

enter image description here

Then, I was already sure those were hard spaces, but I doubled checked here.

What shall we do when we have hard spaces is to try two alternatives.

One is using [[:space:]] in a TRE (default regex engine in Base R functions). The other is using a PCRE regex with a (*UCP) verb at the start to let the regex engine know we deal with Unicode.

In your case, on Linux, it seems the PCRE works, thus you should stick to the PCRE version (that is just more consistent than TRE):

gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)

A quick online test on Linux R:

Benzin <- "29.60 Kč"
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
## => [1] "29.60"