I want to crawl about 4 million pages using scrapy. I am using storm proxies.
Lets say Number of threads allowed on my account is 20.
I want to ask -
Is multithreading means CONCURRENT_REQUESTS_PER_DOMAIN , in scrapy.
or there is an any other way to do that.
How can I effectively use those 20 threads
NOTE - In case I am not clear with my question , please leave a comment, and I will try to elaborate according to that.
Straight out of the docs:
CONCURRENT_REQUESTS- The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.
CONCURRENT_REQUESTS_PER_DOMAIN- The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
CONCURRENT_REQUESTS_PER_IP- The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.
Answering your question directly
I suspect that that service only let's you scrape up to 20 threads overall, meaning it doesn't care what you are requesting so you should use
CONCURRENT_REQUESTS set to 20 maximum (default is 16).
Each request is "kind of a thread". It's built on top of Twisted. In the eyes of the proxy service you are using, there's no way to tell the difference so every request will be a proxy thread!