CptNemo CptNemo - 9 months ago 50
R Question

R: Remove words written in alphabets other than latin (but including diacritics)

I want to search a text for Italian geographic entities based on the geonames database, which I can download with

download.file('http://download.geonames.org/export/dump/IT.zip', destfile = 'IT.zip')
unzip('IT.zip', exdir = 'IT')
it_gn <- read_delim("IT/IT.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)

includes alternative versions of the geographic name, including versions in other languages.

For example

it_gn$X4[it_gn$X1 == 2522713]
# [1] "Vittoira,Vittoria,vu~ittoria,ヴィットーリア"

Since the document I'm searching is in Italian I want to remove all names written in alphabets other than the latin alphabet, but including diacritics used in Italians:
-à , -è , -é , -ì , -ò , -ù
and also
-á , -í , -ó , -ú
(these are formally used in Italian but they might appear). But it is not clear which regex I should use to identify non-latin alphabets.

I tried to apply this answer, but the regex doesn't seem to make any difference...

grepl('[^\\x00-\\x7F]', 'ヴィットーリア')
# [1] TRUE

grepl('[^\\x00-\\x7F]', 'Vittoria')
# [1] TRUE

Answer Source

First, the reason your regexps aren't working is that the regexp escape "\xNN" is a Perl extension, so you need to pass "perl=TRUE" if you want to use it:

> grepl('[^\\x00-\\x7F]', 'ヴィットーリア', perl=TRUE)
[1] TRUE
> grepl('[^\\x00-\\x7F]', 'Vittoria', perl=TRUE)

(Confusingly, the following will work:

> grepl('[^\x01-\x7F]', 'ヴィットーリア')
[1] TRUE
> grepl('[^\x01-\x7F]', 'Vittoria')

because without the double backslash, you're using an R string literal escape sequence "\xNN" instead of the regexp escape sequence above; this embeds the given byte directly in the string regardless of encoding, and it's pretty bad practice, so I'd avoid it here.)

That being said, I think the most readable approach is to just include the Unicode characters in your R code:

isinvalid <- grepl('[^[:ascii:]àèéìòùáíóú]', name,
                   perl=TRUE, ignore.case=TRUE)

The perl=TRUE allows you to use [:ascii:] which despite being ugly, seems more readable than the alternatives, and the ignore.case=TRUE is necessary if you want capitalized versions of the accented characters to be treated as valid as well.

If your environment is too screwed up to include Unicode in your source code, then you can use normal "\u" escapes to include them:

isinvalid <- grepl('[^[:ascii:]\ue0\ue8\ue9\uec\uf2\uf9\ue1\ued\uf3\ufa]', name,
                   perl=TRUE, ignore.case=TRUE)

Note that you should use "\u" escapes and not "\x" escapes here. These are unicode code points, rather than bytes inserted directly into the string. (Again, rather strangely, you could also use \\x escapes, taking advantage of the Perl extension, because -- rather bafflingly -- the Perl regexp "\x" escape acts more like R's string literal "\u" escape rather than its "\x" escape.

Ugh... Anyway, I hope the extra explanation made things more clear rather than less.