CanCeylan CanCeylan - 3 months ago 27
HTML Question

Parse Web Site HTML with JAVA

I want to parse a simple web site and scrape information from that web site.

I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.

URL url = new URL("");
URLConnection uc = url.openConnection();

InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;

FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);

while ((inputLine = in.readLine()) != null) {


File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);

NodeList prelist = doc.getElementsByTagName("body");

Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?


There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like

Document doc = Jsoup.connect("").get();
Elements newsHeadlines ="#mp-itn b a");

Or if you want the body:

Elements body ="body");

Or if you want all links:

Elements links ="body a");

You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.