EricHo EricHo - 1 month ago 6
Java Question

How to grab Chinese characters from HTML code using JAVA inputStream?

I would like to download some data from a website using the following methods.

It has no problem downloading English/number content, but it won't generate the correct Chinese character when I try to grab Chinese content.

String url = "https://hk.finance.yahoo.com/q/ct?s=1928.HK";
URL yahooUrl = new URL(url);
reader = new BufferedReader(new InputStreamReader(yahooUrl.openStream()));
String line ="";
while((line =reader.readLine()) != null){
htmlData.append(line);
}
Pattern p = Pattern.compile(
Pattern.quote("<div class=\"title\"><h2>")+ "(.*?)"
+Pattern.quote("</h2>"));
Matcher match = p.matcher(htmlData.toString());
if(match.find()){
stockName = match.group(1);
}


Anyone know how to grab content in other languages from internet using Java inputstream?

Answer

In your case you didn't specify the character encoding for InputStreamReader, so the platform's default charset is accepted. To read Chinese characters use the UTF-8 charset:

reader = new BufferedReader(new InputStreamReader(yahooUrl.openStream(), "UTF-8"));