Kennedy Kan Kennedy Kan - 4 months ago 13
HTML Question

JAVA parse special charaters

I have a program which collecting some HTML data.

public class Uni_Extract {
public static void main(String[] args) throws Exception {
System.out.println("Started");

String csvFile = "C://Users/Kennedy/Desktop/university.csv";
FileWriter writer = new FileWriter(csvFile);

for (int i=2; i<=2; i++){
String url = "http://www.4icu.org/reviews/index"+i+".htm";
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();

Elements cells = doc.select("td.i");

Iterator<Element> iterator = cells.iterator();
while (iterator.hasNext()) {
Element cell = iterator.next();

String university = Jsoup.parse((cell.select("a").text())).text();
university = StringEscapeUtils.escapeHtml(university);
String country = cell.nextElementSibling().select("img").attr("alt");
System.out.printf("country : %s, university : %s %n", country, university);
}
}
writer.flush();
writer.close();
}
}


However, my program when come across some special charcters, it will return the original HTML code. How should I parse them?

For example, it will return Azerbaycan Dövlet Pedaqoji Universiteti containing "ö" as special char? How could I solve it and other similar cases?

Answer

After a little simplification of your code and removing the call to escapeHtml, everything seems to work correctly. Here's my code and the relevant line of output:

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

import java.io.*;
import java.util.*;

public class Test
{
    public static void main(String[] args) throws IOException {
        System.out.println("Started");

        String url = "http://www.4icu.org/reviews/index2.htm";
        Document doc = Jsoup.connect(url).userAgent("Mozilla").get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();  
        while (iterator.hasNext()) {
            Element cell = iterator.next();

            String university = Jsoup.parse((cell.select("a").text())).text();
            String country = cell.nextElementSibling().select("img").attr("alt");
            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

Output:

...
country : Azerbaijan, university : Azerbaycan Dövlet Aqrar Universiteti
...
Comments