Sayakiss Sayakiss - 1 year ago 129
Java Question

How to tuning HTTPClient performance in crawling large amount small files?

I just want to crawl some Hacker News Stories, and my code:

import org.apache.http.client.fluent.Request;
import java.util.logging.Logger;

public class HackCrawler {
private static String getUrlResponse(String url) throws IOException {
return Request.Get(url).execute().returnContent().asString();

private static String crawlItem(int id) {
try {
String json = getUrlResponse(String.format("", id));
if (json.contains("\"type\":\"story\"")) {
return json;
} catch (IOException e) {
System.out.println("crawl " + id + " failed");
return "";

public static void main(String[] args) throws FileNotFoundException {
Logger logger = Logger.getLogger("main");
PrintWriter printWriter = new PrintWriter("hack.json");
for (int i = 0; i < 10000; i++) {"batch " + i);
IntStream.range(12530671 - (i + 1) * 100, 12530671 - i * 100)
.mapToObj(HackCrawler::crawlItem).filter(x -> !x.equals(""))

Now it will cost 3 seconds to crawl 100(1 batch) items.

I found use multithreading by
will give a speed up (about 5 times), but I have no idea about how to optimise it further.

Could any one give some suggestion about that?

Answer Source

To achieve what Fayaz means I would use Jetty Http Client asynchronous features (

        .send(new Response.CompleteListener()
            public void onComplete(Result result)
                // Your logic here

This client internally uses Java NIO to listen for incoming responses with a single thread per connection. It then dispatches content to worker threads which are not involved in any blocking I/O operation.

You can try to play with the maximum number of connections per destination (a destination is basically an host)

Since you are heavily loading a single server, this should be quite high.