user3249186 user3249186 - 3 months ago 19
Java Question

Jsoup - Parsing a HTML file with charset iso-8859-1

I am having trouble with special characters and

charset = iso-8859-1
.
The same code that I use here works fine with UTF-8, so I do not understand what I am doing wrong.

Here is the code:

File input = new File("/users/marcioapf/example.html");
Document doc = Jsoup.parse(input, "iso-8859-1", "");
Elements elements = doc.select("span.DEPUTADO") ;
System.out.println(elements.toString());


Here is the output:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jo&atilde;ozinho Pereira</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulh&otilde;es</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">In&aacute;cio Loiola</span>


Here is how it should be:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Joãozinho Pereira</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulhões</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span>
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Inácio Loiola</span>


How I can I fix it?

Answer

Using EscapeMode.xhtml will give you output without entities. Try this code

  File input = new File("/users/marcioapf/example.html");
  Document doc = Jsoup.parse(input, "iso-8859-1", "");
  doc.outputSettings().escapeMode(EscapeMode.xhtml);
  Elements elements = doc.select("span.DEPUTADO")  ;
  System.out.println(elements.toString());