Radioactive Coffe Radioactive Coffe - 4 months ago 177
Java Question

How to scrape Google SERPs with Jsoup?

I was trying to scrape links from google using 600 different searches, In the process of this I started getting the following error.

Error

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...




Now I've done my research and it happens because of google scholar ban restricting you to limited searches and need to solve captch to proceed, which jsoup can't do.



Code

Document doc = Jsoup.connect("http://google.com/search?q=" + keyWord)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000)
.get();




Answers on the internet are extremely vague and doesn't provide a clear solution, someone did mention cookies can solve this issue but haven't said a single thing about "how" to do it.

Answer

Some hints to improve your scraping:

1. Use proxies

Proxies permit you to reduce chances to get caught by a captcha. You should use between 50 and 150 proxies depending on your average result set. Here are two websites that can provide some proxies: SEO-proxies.com or Proxify Switch Proxy.

// Setup proxy
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort))

// Fetch url with proxy
Document doc = Jsoup //
               .proxy(proxy) //
               .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
               .header("Content-Language", "en-US") //
               .connect(searchUrl) //
               .get();

2. Captchas

If by any mean, you get caught by captcha, you can use some online captcha solving services (Bypass Captcha, DeathByCaptcha to name a few). Below is a generic step by step procedure to get the captcha solved automatically:

  • Detect captcha error page

--

try {

  // Perform search here...

} catch(HttpStatusException e) {
    switch(e.getStatusCode()) {
        case java.net.HttpURLConnection.HTTP_UNAVAILABLE:
            if (e.getUrl().contains("http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...")) {
               // Ask online captcha service for help...
            } else {
               // ...
            }
        break;

        default:
        // ...
    } 
}
  • Download the captcha image (CI)

--

Jsoup                     //
//.cookie(..., ...)       // Some cookies may be needed...
.connect(imageCaptchaUrl) //
.ignoreContentType(true)  // Needed for fetching image
.execute()                //
.bodyAsBytes();           // byte[] array returned...
  • Send CI to online captcha service online

--

This part depends on the captcha service API. You can find some services in this 8 best captcha solving services article.

  • Wait for response... (1-2 second(s) is perfect)
  • Fill the form with response and send it with Jsoup

    The Jsoup FormElement is a life saver here. See this working sample code for details.

3. Some other hints

The Hints for Google scrapers article can give you some more pointers for improving your code. You'll find the first two hints presented here plus some more:

  • Cookies: clear them on each IP change or don't use them at all
  • Threads: You should not open two many connections. Firefox limits itself to 4 connections per proxy.
  • Returned results: append &num=100 to your url to sent less requests
  • Request rates: Make your requests look human. You should not send more than 500 requests per 24h per IP.

References :

Comments