zhanzezhu zhanzezhu - 4 months ago 36
SQL Question

How to only filter chinese(special character set) in oracle?

data source:

测试demo
demo1
中文2


output:

demo
demo1
2


I try to
select regexp_replace('测试中文demo','[\u0391-\uFFE5]','') from dual
,but no effect. And
\w
including the chinese, so don't use
[^\w]
.

Now I can think of
select regexp_replace('测试中文demo','[^a-zA-Z0-9\s]','') from dual
.

Is there a better way to do it?

Answer

See Searching Unicode characters in Oracle table:

Normally with Regular expression, you can use \x or \u followed by hexadecimal code to search for any character. E.g. \x20 will match space. But REGEXP_LIKE in Oracle does not support \x. You need to use unistr function to convert the code to equivalent character and then use it with REGEXP_LIKE. E.g. REGEXP_LIKE(source,'[' ||unistr('\0020')|| ']');

So, you need something like

select regexp_replace('测试中文demo', '[' || unistr('\0391') || '-' || unistr('\9FA5') || ']','') from dual

Note: a better Chinese regex should include all the following ranges:

---------------------------------------------------------------------------------- 
|Block                                   | ES6 Range   |   ES5 Range              |
|---------------------------------------------------------------------------------|
|CJK Unified Ideographs                  | 4E00-9FFF   | \u4E00-\u9FFF            |
|CJK Unified Ideographs Extension A      | 3400-4DFF   | \u3400-\u4DFF            |
|CJK Unified Ideographs Extension B      | 20000-2A6DF | \uD840\uDC00-\uD869\uDEDF|
|CJK Unified Ideographs Extension C      | 2A700–2B73F | \uD869\uDF00-\uD86D\uDF3F|
|CJK Unified Ideographs Extension D      | 2B740–2B81F | \uD86D\uDF40-\uD86E\uDC1F|
|CJK Unified Ideographs Extension E      | 2B820–2CEAF | \uD86E\uDC20-\uD873\uDEAF|
|CJK Compatibility Ideographs            | F900-FAFF   | \uF900-\uFAFF            |
|CJK Compatibility Ideographs Supplement | 2F800-2FA1F | \uD87E\uDC00-\uD87E\uDE1F|
----------------------------------------------------------------------------------

So, try

select regexp_replace('测试中文demo','[' || unistr('\4E00') || '-' || unistr('\9FFF') || unistr('\3400') || '-' || unistr('\4DFF') || unistr('\D840\DC00') || '-' || unistr('\D869\DEDF') || unistr('\D869\DF00') || '-' || unistr('\D86D\DF3F') || unistr('\D86D\DF40') || '-' || unistr('\D86E\DC1F') || unistr('\D86E\DC20') || '-' || unistr('\D873\DEAF') || unistr('\F900') || '-' || unistr('\FAFF') || unistr('\D87E\DC00') || '-' || unistr('\D87E\DE1F') || ']','') from dual

Comments