I want to search a text for Italian geographic entities based on the geonames database, which I can download with
download.file('http://download.geonames.org/export/dump/IT.zip', destfile = 'IT.zip')
unzip('IT.zip', exdir = 'IT')
it_gn <- read_delim("IT/IT.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
it_gn$X4[it_gn$X1 == 2522713]
#  "Vittoira,Vittoria,vu~ittoria,ヴィットーリア"
-à , -è , -é , -ì , -ò , -ù
-á , -í , -ó , -ú
#  TRUE
#  TRUE
First, the reason your regexps aren't working is that the regexp escape "\xNN" is a Perl extension, so you need to pass "perl=TRUE" if you want to use it:
> grepl('[^\\x00-\\x7F]', 'ヴィットーリア', perl=TRUE)  TRUE > grepl('[^\\x00-\\x7F]', 'Vittoria', perl=TRUE)  FALSE >
(Confusingly, the following will work:
> grepl('[^\x01-\x7F]', 'ヴィットーリア')  TRUE > grepl('[^\x01-\x7F]', 'Vittoria')  FALSE >
because without the double backslash, you're using an R string literal escape sequence "\xNN" instead of the regexp escape sequence above; this embeds the given byte directly in the string regardless of encoding, and it's pretty bad practice, so I'd avoid it here.)
That being said, I think the most readable approach is to just include the Unicode characters in your R code:
isinvalid <- grepl('[^[:ascii:]àèéìòùáíóú]', name, perl=TRUE, ignore.case=TRUE)
perl=TRUE allows you to use
[:ascii:] which despite being ugly, seems more readable than the alternatives, and the
ignore.case=TRUE is necessary if you want capitalized versions of the accented characters to be treated as valid as well.
If your environment is too screwed up to include Unicode in your source code, then you can use normal "\u" escapes to include them:
isinvalid <- grepl('[^[:ascii:]\ue0\ue8\ue9\uec\uf2\uf9\ue1\ued\uf3\ufa]', name, perl=TRUE, ignore.case=TRUE)
Note that you should use "\u" escapes and not "\x" escapes here. These are unicode code points, rather than bytes inserted directly into the string. (Again, rather strangely, you could also use
\\x escapes, taking advantage of the Perl extension, because -- rather bafflingly -- the Perl regexp "\x" escape acts more like R's string literal "\u" escape rather than its "\x" escape.
Ugh... Anyway, I hope the extra explanation made things more clear rather than less.