Kishan Donga Kishan Donga - 1 year ago 57
Java Question

how to distinguish Unicode characters and ASCII characters

I want to distinguish Unicode characters and ASCII characters from the below string:

abc\u263A\uD83D\uDE0A\uD83D\uDE22123


How can I distinguish characters? Can anyone help me with this issue? I have tried some code, but it crashes in some cases. What is wrong with my code?

The first three characters are
abc
, and the last three characters are
123
. The rest of the string is Unicode characters. I want to make a string array like this:

str[0] = 'a';
str[1] = 'b';
str[2] = 'c';
str[3] = '\u263A\uD83D';
str[4] = '\uDE0A\uD83D';
str[5] = '\uDE22';
str[6] = '1';
str[7] = '2';
str[8] = '3';


Code:

private String[] getCharArray(String unicodeStr) {
ArrayList<String> list = new ArrayList<>();
for (int i = 0; i < unicodeStr.length(); i++) {
if (unicodeStr.charAt(i) == '\\') {
list.add(unicodeStr.substring(i, i + 11));
i = i + 11;
} else {
list.add(String.valueOf(unicodeStr.charAt(i)));
}
}
return list.toArray(new String[list.size()]);
}

Answer Source

ASCII characters exist in Unicode, they are Unicode codepoints U+0000 - U+007F, inclusive.

Java strings are represented in UTF-16, which is a 16-bit byte encoding of Unicode. Each Java char is a UTF-16 code unit. Unicode codepoints U+0000 - U+FFFF use 1 UTF-16 code unit and thus fit in a single char, whereas Unicode codepoints U+10000 and higher require a UTF-16 surrogate pair and thus need two chars.

If the string has UTF-16 code units represented as actual char values, then you can use Java's string methods that work with codepoints, eg:

private String[] getCharArray(String unicodeStr) {
    ArrayList<String> list = new ArrayList<>();
    int i = 0, j;
    while (i < unicodeStr.length()) {
        j = unicodeStr.offsetByCodePoints(i, 1);
        list.add(unicodeStr.substring(i, j));
        i = j;
    }
    return list.toArray(new String[list.size()]);
}

On the other hand, if the string has UTF-16 code units represented in an encoded "\uXXXX" format (ie, as 6 distinct characters - '\', 'u', ...), then things get a little more complicated as you have to parse the encoded sequences manually.

If you want to preserve the "\uXXXX" strings in your array, you could do something like this:

private boolean isUnicodeEncoded(string s, int index)
{
    return (
        (s.charAt(index) == '\\') &&
        ((index+5) < s.length()) &&
        (s.charAt(index+1) == 'u')
    );
}

private String[] getCharArray(String unicodeStr) {
    ArrayList<String> list = new ArrayList<>();
    int i = 0, j, start;
    char ch;
    while (i < unicodeStr.length()) {
        start = i;
        if (isUnicodeEncoded(unicodeStr, i)) {
            ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
            j = 6;
        }
        else {
            ch = unicodeStr.charAt(i);
            j = 1;
        }
        i += j;
        if (Character.isHighSurrogate(ch) && (i < unicodeStr.length())) {
            if (isUnicodeEncoded(unicodeStr, i)) {
                ch = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
                j = 6;
            }
            else {
                ch = unicodeStr.charAt(i);
                j = 1;
            }
            if (Character.isLowSurrogate(ch)) {
                i += j;
            }
        }
        list.add(unicodeStr.substring(start, i));
    }
    return list.toArray(new String[list.size()]);
}

If you want to decode the "\uXXXX" strings into actual chars in your array, you could do something like this instead:

private boolean isUnicodeEncoded(string s, int index)
{
    return (
        (s.charAt(index) == '\\') &&
        ((index+5) < s.length()) &&
        (s.charAt(index+1) == 'u')
    );
}

private String[] getCharArray(String unicodeStr) {
    ArrayList<String> list = new ArrayList<>();
    int i = 0, j;
    char ch1, ch2;
    while (i < unicodeStr.length()) {
        if (isUnicodeEncoded(unicodeStr, i)) {
            ch1 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
            j = 6;
        }
        else {
            ch1 = unicodeStr.charAt(i);
            j = 1;
        }
        i += j;
        if (Character.isHighSurrogate(ch1) && (i < unicodeStr.length())) {
            if (isUnicodeEncoded(unicodeStr, i)) {
                ch2 = (char) Integer.parseInt(unicodeStr.substring(i+2, i+6), 16);
                j = 6;
            }
            else {
                ch2 = unicodeStr.charAt(i);
                j = 1;
            }
            if (Character.isLowSurrogate(ch2)) {
                list.add(String.valueOf(new char[]{ch1, ch2}));
                i += j;
                continue;
            }
        }
        list.add(String.valueOf(ch1));
    }
    return list.toArray(new String[list.size()]);
}

Or, something like this (per https://stackoverflow.com/a/24046962/65863):

private String[] getCharArray(String unicodeStr) {
    Properties p = new Properties();
    p.load(new StringReader("key="+unicodeStr));
    unicodeStr = p.getProperty("key");
    ArrayList<String> list = new ArrayList<>();
    int i = 0;
    while (i < unicodeStr.length()) {
        if (Character.isHighSurrogate(unicodeStr.charAt(i)) &&
            ((i+1) < unicodeStr.length()) &&
            Character.isLowSurrogate(unicodeStr.charAt(i+1)))
        {
            list.add(unicodeStr.substring(i, i+2));
            i += 2;
        }
        else {
            list.add(unicodeStr.substring(i, i+1));
            ++i;
        }
    }
    return list.toArray(new String[list.size()]);
}
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download