sagar sagar - 3 months ago 38
Python Question

Multithreading in Scrapy using proxies

I want to crawl about 4 million pages using scrapy. I am using storm proxies.
Lets say Number of threads allowed on my account is 20.
I want to ask -

Is multithreading means CONCURRENT_REQUESTS_PER_DOMAIN , in scrapy.

or there is an any other way to do that.

How can I effectively use those 20 threads

NOTE - In case I am not clear with my question , please leave a comment, and I will try to elaborate according to that.

Answer

Straight out of the docs:

CONCURRENT_REQUESTS- The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.

CONCURRENT_REQUESTS_PER_DOMAIN - The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.

CONCURRENT_REQUESTS_PER_IP - The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.

Answering your question directly

I suspect that that service only let's you scrape up to 20 threads overall, meaning it doesn't care what you are requesting so you should use CONCURRENT_REQUESTS set to 20 maximum (default is 16).

Each request is "kind of a thread". It's built on top of Twisted. In the eyes of the proxy service you are using, there's no way to tell the difference so every request will be a proxy thread!

Comments