tschens tschens - 5 months ago 42
Java Question

Encoding of umlaute in Jsoup with strange behaviour

I have some problems with the encoding behaviour of JSoup library.

I want to parse the content of a webpage, and therefore I have to insert some person's names, that could also contain german umlaute as ä, ö, etc.

This is the code I am using:

doc = Jsoup.parse(new URL(searchURL).openStream(), "UTF-8", searchURL);


to parse the html of the resp. webpage.

But when I take a look into the document, the ä is shown as followed:

Käse

What am I doing wrong with the encoding?

The webpage has the following header:

<!doctype html>
<html>
<head lang="en">
<title>Käse - Semantic Scholar</title>
<meta charset="utf-8">
</html>


Someone help? Thanks :)

EDIT: I tried Stephans answer and it worked for the webpage www.semanticscholar.org, but I am also parsing another webpage,
http://www.authormapper.com/

And the same code does not work for this webpage, if the name of an author contains a german umlaut.
Does anyone know why this is not working? It's very embarissing for not to know this....

Answer

This is a known issue of Jsoup. Here are two options to load the content for Jsoup:

Option 1: JDK only

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200

    int n;
    char[] buffer = new char[4096];
    Reader r = new InputStreamReader(is, "UTF-8");
    Writer w = new StringBuilderWriter();
    while (-1 != (n = r.read(buffer))) {
        w.write(buffer, 0, n);
    }

    // Parse html
    String html = w.toString();
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    try {
        if (is != null) {
            is.close();
        }
    } catch (final IOException ioe) {
        // ignore
    }
}

Option 2: With Commons IO

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200
    String html = IOUtils.toString(is, "UTF-8")

    // Parse html
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    IOUtils.closeQuietly(is);
}