mumpitz mumpitz - 3 years ago 122
Java Question

Cannot identify surrogate characters in Java string

I am having trouble identifying surrogate characters in strings like

devā́n
. I read the relevant questions concerning the topic here on SO, but something is still wrong with this...

As you see, the "natural" length (i just made up that expression) of this string is 5, but
"devā́n".length()
gives me 6.

That is fine, because
ā́
consists of two characters internally (it's not withing the UTF-16 code range). But i would like to get the length of the string as you'd read it or as it's printed, so
5
in this case.

I tried identifying the weirdo chars with the following tricks found here and here, but it doesn't work and i'm always getting 6. Just have a look at this:

//string containing surrogate pair
String s = "devā́n";

//prints the string properly
System.out.println("String: " + s);

//prints "Length: 6"
System.out.println("Length: " + s.length());

//prints "Codepoints: 6"
System.out.println("Codepoints: " + s.codePointCount(0, s.length()));

//false
System.out.println(
Character.isSurrogate(s.charAt(3)));

//false
System.out.println(
Character.isSurrogate(s.charAt(4)));

//six code points
System.out.println("\n");
for (int i = 0; i < s.length(); i++) {
System.out.println(s.charAt(i) + ": " + s.codePointAt(i));
}


Is it maybe possible that
ā́
is not a valid pair of surrogate chars? How can i identify such a compound char and count it as only one?

BTW the output of above code is

String: devā́n
Length: 6
Codepoints: 6
false
false


d: 100
e: 101
v: 118
ā: 257
́: 769
n: 110

Answer Source

First of all, the reason that 769 (U+0301) is not testing as a surrogate character, is that it is NOT a surrogate characters. Surrogate characters are used when a Unicode codepoint is outside of plane 0 is represented in UTF-16. (Surrogates are code units in the range U+D800 through U+DFFF.)

So what you are really trying to do here is to figure out how many "ordinary" characters there are in a UTF-16 string. This is done in two steps:

  • First, normalize the string to NFC form (see Normalizing Text) using the Normalizer API.
  • Then use the String API to find the number of code points in the string; e.g. use String.codePointCount (javadoc).

In this case, this still fails. The reason is that the code point sequence

ā: 257
́: 769

actually represents an "a" character with two diacritical marks. This cannot be represented as a single Unicode codepoint, so the NFC for is two codepoints.

What confuses this even further is that a typical renderer will display the "acute" accent gets displayed over the following "n" character.

It is going to be very difficult to deal with pathological examples like this where base characters have multiple diacriticals that might render strangely. Maybe you need to translate to NFD and then count the code points that are not diacriticals.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download