haimen haimen - 1 year ago 314
SQL Question

Remove Unicode characters while querying in Hive

I want to clean the unicode the data from the Hive table. The following is the data,

select ('http://10.0.0.1/���m��v������) �a�^�����kn:4�+9x�2c��m�{��')


My required output is to find if there are any unicode characters in my column and to remove it. The output here should be,

http://10.0.0.1/


or completely null. Either of them is fine. If a row contains any unicode character, it is fine to make it null completely.

The following are my tryings,

select REGEXP_REPLACE('http://10.0.0.1/���m��v������) �a�^�����kn:4�+9x�2c��m�{��', '\\[[:xdigit:]]{4}', '')


and

select REGEXP_REPLACE('http://10.0.0.1/���m��v������) �a�^�����kn:4�+9x�2c��m�{��', '[||chr(128)||'-'||chr(255)||]', '')

Executed as Single statement. Failed [40000 : 42000] Error while compiling statement: FAILED: ParseException line 1:193 mismatched input '<EOF>' expecting ) near ')' in function specification
Elapsed time = 00:00:00.220

STATEMENT 1: SELECT Statement failed.


Can somebody help me in cleaning these in my table ?

Thanks

Answer Source

You may use

select REGEXP_REPLACE(YOUR_STRING_HERE, '\\P{ASCII}.*', '')

It will remove all the string up to its end from the first found non-ASCII char.

Hive regex supports Unicode property classes, and \p{ASCII} matches any ASCII char. The opposite Unicode property classes are formed by turning p to upper case. So, \P{ASCII} matches any char that is not ASCII. .* matches any 0+ chars as many as possible, as * is a greedy quantifier.

Note that . does not match line breaks by default. If you need to remove line breaks, add (?s) at the start of the pattern:

'(?s)\\P{ASCII}.*'
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download