namarino namarino - 1 month ago 26
Java Question

Java Jsoup Google Image Search result parsing

I'm using jsoup to parse Google image results. I'm trying to get the

src
of the image. Here is my code so far. The output is truncated for some reason and I can't access the
src
attribute. Does anyone know why this is happening and what I can do to fix it? Thanks so much!

public static void main(String args[]) {
try {
// Does a google image search for "test"
final Document doc = Jsoup.connect("https://www.google.com/search?q=test&tbm=isch").userAgent(USER_AGENT).get();

// selects images
Elements elements = doc.select("img.rg_ic.rg_i");
// cycles through elements and prints attributes
for (Element e : elements) {
System.out.print(e);
}


} catch (IOException e) {
e.printStackTrace();
}
}


Output:

<img class="rg_ic rg_i" data-sz="f" name="XWXPqrX1RFJiaM:" alt="Image result for test" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)">

Answer

The following code provides the urls for the first 100 image results with jsoup. If you need all results you have to use a headless browser (I recommend PhantomJS, see this answer for usage).

The static html source has the image urls for the first 100 result solely stored in JSON objects. For parsing the scraped JSON objects, I used JSON.simple

The JSON objects are contained in the <div> elements with class rg_meta and are in the following form:

{"st":"Uber","tu":"https:\/\/encrypted-tbn3.gstatic.com\/images?q=tbn:ANd9GcTSEUMluu1kigjR3JU40BYfaH0fQ6JW1vk9WScBiXr--lsMILf2","ru":"https:\/\/newsroom.uber.com\/uberkittens-are-back\/","tw":300,"pt":"UberKittens Delivers Kittens to Play or Stay","ou":"https:\/\/newsroom.uber.com\/wp-content\/uploads\/2015\/10\/HQ_uberkittens_blog_960x540_r1v1.jpg","ow":960,"cl":6,"isu":"newsroom.uber.com","rid":"vLA3QXY8xPE4PM","cr":3,"ity":"jpg","sc":1,"ct":15,"s":"Clear Your Calendars\u2014#UberKITTENS Are Back","th":168,"oh":540,"id":"qCR7qXt7VX38iM:","itg":false,"cb":15}

So for the url we need to extract the value for the key "ou".

Example Code

// can only grab first 100 results
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=kittens&gws_rd=cr";

List<String> resultUrls = new ArrayList<String>();

try {
    Document doc = Jsoup.connect(url).userAgent(userAgent).referrer("https://www.google.com/").get();

    Elements elements = doc.select("div.rg_meta");

    JSONObject jsonObject;
    for (Element element : elements) {
        if (element.childNodeSize() > 0) {
            jsonObject = (JSONObject) new JSONParser().parse(element.childNode(0).toString());
            resultUrls.add((String) jsonObject.get("ou"));
        }
    }

    System.out.println("number of results: " + resultUrls.size());

    for (String imageUrl : resultUrls) {
        System.out.println(imageUrl);
    }

} catch (IOException | ParseException e) {
    e.printStackTrace();
}

Output

number of results: 100
https://newsroom.uber.com/wp-content/uploads/2015/10/HQ_uberkittens_blog_960x540_r1v1.jpg
https://pbs.twimg.com/profile_images/562466745340817408/_nIu8KHX.jpeg
http://leecamp.net/wp-content/uploads/kitten-3.jpg 
...