ahmet alp balkan ahmet alp balkan - 4 months ago 21x
Java Question

Converting Symbols, Accent Letters to English Alphabet

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.

For instance here are a few conversions:

tђє Ŧค๓เℓy --> the Family

and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.

The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.

How can I convert all these with Java? Please help me :(


Reposting my post from How do I remove diacritics (accents) from a string in .NET?

This method works fine in java (purely for the purpose of removing diacritical marks aka accents).

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");