R Question

Regular expression to remove specific multi-byte characters in R

I am trying to remove specific multi-byte characters in R.

Multibyte <- "Sungpil_한성필_韓盛弼_Han"

The linguistic structure of
is "English_Korean_Chinese_English" What I want to remove is the Korean word only or Chinese word only (not both).

A desired result is either :

Sungpil_한성필__Han # Chinese characters were removed.


Sungpil__韓盛弼_Han # Korean characters were removed.

Is there a simple way to do it by using
? I am only aware of a method to get English-only characters.

gsub("[^A-Za-z_]", "", Multibyte)
[1] "Sungpil___Han"

Answer Source

Answering the question itself, yes, you may do it with a mere gsub using a PCRE regex and Unicode property classes \p{Hangul} for matching Korean chars, and \p{Han} to match Chinese chars:

> Multibyte <- "Sungpil_한성필_韓盛弼_Han"
> gsub("\\p{Hangul}+", "",Multibyte, perl=TRUE)
[1] "Sungpil__韓盛弼_Han"
> gsub("\\p{Han}+", "",Multibyte, perl=TRUE)
[1] "Sungpil_한성필__Han"

See R online demo.

However, if you have a specific structure of the input text, use the other solution.

