Iñigo Allende Iñigo Allende - 6 months ago 46
Java Question

Scrapping data from webpage. Java, HTMLUnit

I'm trying to scrap some information from a webpage. My problem is the return I get doesn't contain what I´m looking for.

If I inspect the source code of the web I find an empty section

<section id="player-controller">
</section>


But if I inspect the the elements I want data from, they appear inside that section

Since it's generated dynamically I tried using HTMLUnit, but I stil can't get it. Maybe I'm looking at this the wrong way.

Is there any way I can get the code with HTMLUnit or should I use a different tool?

Solved

By using HTMLUnit and making the process stop some time before printing the page I got it to printing the missing content

WebClient webclient = new WebClient();
HtmlPage currentPage = webclient.getPage("https://www.dubtrack.fm/join/chilloutroom");
Thread.sleep(2000);
System.out.println(currentPage.asXml());

Answer

If you examine the text of the page as it is first loaded, the dynamic contents won't be loaded yet. The javascript in callScraper.html will call another page and then wait two seconds before reading the contents of the HTML element. Timing could be tricky here. I hope the following code will be helpful.

callScraper.html

<!DOCTYPE html>
<head>
<title>Call test for scraping</title
<meta charset="UTF-8" />
<script>
var newWindow;
var contents;
function timed() {
contents.value = contents.value + "\r\n" +"function timed started" + "\r\n";
contents.value = contents.value + "\r\n" + newWindow.document.getElementById("player-controller").innerHTML;
}
function starter() {
// alert("Running starter");
contents = document.getElementById("contents");
newWindow = window.open("scraper.html");
contents.value = contents.value + "\r\nTimer started\r\n";
setTimeout(timed, 2000);
}
window.onload=starter;
</script>
</head>
<body>
<p>This will open another page and then diplay an element from that page.</p>
<form name="reveal">
<textarea id="contents" cols="50" rows="50"></textarea>
</form>
</body>
</html>

scraper.html

<!DOCTYPE html>
<head>
<title>Test for scraping</title>
<meta charset="UTF-8" />
<script>
var section;
function starter() {
section = document.getElementById("player-controller");
// alert(":"+section.innerHTML+";");
section.innerHTML = "<p>inner text</p>";
// alert(":" +section.innerHTML + ":");
}
window.onload = starter;
</script>
</head>
<body>
<p>See http://stackoverflow.com/questions/37513393/scrapping-data-from-webpage-java-htmlunit</p>
<section id="player-controller">

</section>
</body>
</html>