Sayakiss Sayakiss - 2 months ago 24
Java Question

How to tuning HTTPClient performance in crawling large amount small files?

I just want to crawl some Hacker News Stories, and my code:

import org.apache.http.client.fluent.Request;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.logging.Logger;
import java.util.stream.IntStream;

public class HackCrawler {
private static String getUrlResponse(String url) throws IOException {
return Request.Get(url).execute().returnContent().asString();
}

private static String crawlItem(int id) {
try {
String json = getUrlResponse(String.format("https://hacker-news.firebaseio.com/v0/item/%d.json", id));
if (json.contains("\"type\":\"story\"")) {
return json;
}
} catch (IOException e) {
System.out.println("crawl " + id + " failed");
}
return "";
}

public static void main(String[] args) throws FileNotFoundException {
Logger logger = Logger.getLogger("main");
PrintWriter printWriter = new PrintWriter("hack.json");
for (int i = 0; i < 10000; i++) {
logger.info("batch " + i);
IntStream.range(12530671 - (i + 1) * 100, 12530671 - i * 100)
.parallel()
.mapToObj(HackCrawler::crawlItem).filter(x -> !x.equals(""))
.forEach(printWriter::println);
}
}
}


Now it will cost 3 seconds to crawl 100(1 batch) items.

I found use multithreading by
parallel
will give a speed up (about 5 times), but I have no idea about how to optimise it further.

Could any one give some suggestion about that?

Answer

To achieve what Fayaz means I would use Jetty Http Client asynchronous features (https://webtide.com/the-new-jetty-9-http-client/).

httpClient.newRequest("http://domain.com/path")
        .send(new Response.CompleteListener()
        {
            @Override
            public void onComplete(Result result)
            {
                // Your logic here
            }
        });

This client internally uses Java NIO to listen for incoming responses with a single thread per connection. It then dispatches content to worker threads which are not involved in any blocking I/O operation.

You can try to play with the maximum number of connections per destination (a destination is basically an host)

http://download.eclipse.org/jetty/9.3.11.v20160721/apidocs/org/eclipse/jetty/client/HttpClient.html#setMaxConnectionsPerDestination-int-

Since you are heavily loading a single server, this should be quite high.

Comments