Saverio Java Saverio Java - 3 months ago 46
Java Question

href field missing when I get the page using jsoup or htmlunit

I'm trying to parse google images search result.

I'm trying to get the href attribute of an element. I've noticed that the href field is missing when I get the page programmatically (this happens with both jsoup and htmlunit).
Comparing the element of the page got programmatically through java and the element of the page loaded by the actual browser, the only difference is, indeed, the href field that is missing (the rest is the same).

The href attribute (IMAGE_LINK) is the following:

/imgres?imgurl=http%3A%2F%2Fcdn.zonarutoppuden.com%2Fns%2Fpe‌​liculas-naruto-shipp‌​uden.jpg&imgrefurl=h‌​ttp%3A%2F%2Fwww.zona‌​rutoppuden.com%2F201‌​0%2F10%2Fnaruto-ship‌​puden-peliculas.html‌​&docid=JR8NPqKrF3ac_‌​M&tbnid=0EPPOYQcflXk‌​MM%3A&w=900&h=600&bi‌​h=638&biw=1275&ved=0‌​ahUKEwih9O2e88_OAhWM‌​ExoKHRLGAGQQMwg2KAMw‌​Aw&iact=mrc&uact=8


Maybe some issue with the javascript engine? Or maybe some kind of algorithm anti-parsing used by the website?

Snippet Java Code:

WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.waitForBackgroundJavaScript(50000);
HtmlPage page1=null;

try {
// Get the first page
page1 = webClient.getPage(URL);
System.out.println(page1.asXml());
} catch (FailingHttpStatusCodeException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}


Snippet Html Code (Real Browser):

<a jsaction="fire.ivg_o;mouseover:str.hmov;mouseout:str.hmou" class="rg_l" style="width: 134px; height: 201px; left: 0px; background: rgb(128, 128, 128);" href="IMAGE_LINK"> CONTENT... </a>


Snippet Html Code (Page got programmatically):

<a jsaction="fire.ivg_o;mouseover:str.hmov;mouseout:str.hmou" class="rg_l" style="width: 134px; height: 201px; left: 0px; background: rgb(128, 128, 128);"> CONTENT... </a>


Thank you.

Answer

For each search result there is a <div class="rg_meta">containing a JSON object, which also holds the url. Using a JSON parser like json-simple to parse the object, the following code prints the image urls:

String searchTerm = "naruto shippuden";
String searchUrl = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&biw=1920&bih=955&q=" + searchTerm.replace(" ", "+") + "&gws_rd=cr";

try {
    Document doc = Jsoup.connect(searchUrl)
            .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
            .referrer("https://www.google.com/").get();

    JSONObject obj;

    for (Element result : doc.select("div.rg_meta")) {

        // div.rg_meta contains a JSON object, which also holds the image url
        obj = (JSONObject) new JSONParser().parse(result.text());

        String imageUrl = (String) obj.get("ou");

        // just printing out the url to demonstate the approach
        System.out.println("imageUrl: " + imageUrl);    
    } 

} catch (IOException e1) {
    e1.printStackTrace();
}catch (ParseException e) {
    e.printStackTrace();
}

Output:

imageUrl: http://ib3.huluim.com/show_key_art/1603?size=1600x600&region=US
imageUrl: http://cdn.zonarutoppuden.com/ns/peliculas-naruto-shippuden.jpg
imageUrl: http://www.saiyanisland.com/news/wp-content/uploads2/2014/12/Naruto-Sasuke.jpg
...
Comments