tomas tomas - 4 months ago 22
Java Question

Is there a Char collator in Java?

I'm working on a small app which counts character appearance in a text and prints a simple report. It bases on a TreeMap. It is supposed to work with any UTF-8 (so far) codable languages. When I try to use the standard collator by calling

Collator.getInstance()
I get the exception
java.lang.ClassCastException: java.lang.Character cannot be cast to java.lang.String
.

Is there any Char collator?

static Map<Character, Integer> map = new TreeMap<>();


The TreeMap constructor can take a collator, but not for Chars.

public static void main(String[] args) {
InputStream in = System.in;

try {
if (in.available() == 0) System.exit(0);
} catch (IOException e) {
e.printStackTrace();
}

count(in);
printMap();
}


static void count(InputStream in) {
new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8))
.lines()
.forEach(x -> tallyCharArray(x.toCharArray()));
}

static void tallyCharArray (char[] chars) {
for (int i=0; i<chars.length; i++) {
map.merge(chars[i], 1, Integer::sum);
}
}

static void printMap() {
map.entrySet().stream()
.forEach(x -> System.out.println(x.getKey() + "\t" + x.getValue()));
}


PROBLEM with compare

static Map<Character, Integer> map = new TreeMap<>(
Collator.getInstance().compare(String.valueOf(c1), String.valueOf(c2))
);


This is clumsy, and it doesn't work yet. How to bind
c1
and
c2
with the map?

Answer

UPDATED

If you only want the Collator for sorting the result when printing it, just sort after counting. Much better for performance. See code further down.

If you want the TreeMap to use a Collator, get the Collator, then give a Comparator<Character> to the TreeMap constructor. Since you're using Java 8 streams, you might as well do this using a lambda expression:

Collator collator = Collator.getInstance(Locale.GERMAN);
collator.setStrength(Collator.PRIMARY);
Map<Character, int[]> countMap = new TreeMap<>(
        (c1, c2) -> collator.compare(c1.toString(), c2.toString())
);

Using that Collator, accents and upper-/lower-case characters are all merged. See sample output at the end of this answer.

Full code for sorting after counting

String input = "Das Polaritätsprofil für das Wort \"Hund\" als Testeinheit " +
               "könnte zeigen , dass verschiedene Personen unterschiedliche " +
               "Einstellungen zu diesen Tieren haben .";

Map<Character, int[]> countMap = new HashMap<>();
for (Character ch : input.toCharArray()) {
    int[] counter = countMap.get(ch);
    if (counter == null)
        countMap.put(ch, new int[] { 1 });
    else
        counter[0]++;
}
@SuppressWarnings("unchecked")
Entry<Character, int[]>[] counts = countMap.entrySet().toArray(new Map.Entry[countMap.size()]);
Collator collator = Collator.getInstance(Locale.GERMAN);
Arrays.sort(counts, (e1, e2) -> collator.compare(e1.toString(), e2.toString()));
for (Entry<Character, int[]> entry : counts)
    System.out.printf("%c - %d%n", entry.getKey(), entry.getValue()[0]);

Output from sorting after counting

, - 1
. - 1
" - 2
  - 20
a - 6
ä - 1
b - 1
c - 3
d - 6
D - 1
E - 1
e - 22
f - 2
g - 2
h - 5
H - 1
i - 11
k - 1
l - 6
n - 15
ö - 1
o - 4
P - 2
p - 1
r - 8
s - 12
t - 8
T - 2
u - 4
ü - 1
v - 1
W - 1
z - 2

As you can see, the result is printed according to German collation, with ä between a and b.

If you want upper- and lower-case characters unified, you should decide which you want in the result and convert to that, otherwise it'll be arbitrary.

Output from using PRIMARY Collator in TreeMap

  - 20
, - 1
. - 1
" - 2
a - 7
b - 1
c - 3
D - 7
e - 23
f - 2
g - 2
H - 6
i - 11
k - 1
l - 6
n - 15
o - 5
P - 3
r - 8
s - 12
t - 10
ü - 5
v - 1
W - 1
z - 2

As you can see, sometimes you get a lowercase letter (e.g. a), sometimes you get an uppercase letter (e.g. D), and sometimes you get an accented letter (e.g. ü). That just seems wrong to me.