nfo nfo - 3 months ago 27
Python Question

Scrapy performance improvements and memory consumtion

Server


  • 6 GB RAM

  • 4 Cores Intel Xeon 2.60GHz

  • 32 CONCURRENT_REQUESTS

  • 1m URLs in CSV

  • 700 Mbit/s downstream

  • 96% Memory Consumtion



With debug mode on, the scrape stops after around 400 000 urls, most likely because the server runs out of memory.
Without debug mode it takes up to 5 days which is pretty slow imo and
it takes way to much memory (96%)

any hints are highly welcome :)

import scrapy
import csv

def get_urls_from_csv():
with open('data.csv', newline='') as csv_file:
data = csv.reader(csv_file, delimiter=',')
scrapurls = []
for row in data:
scrapurls.append("http://"+row[2])
return scrapurls

class rssitem(scrapy.Item):
sourceurl = scrapy.Field()
rssurl = scrapy.Field()


class RssparserSpider(scrapy.Spider):
name = "rssspider"
allowed_domains = ["*"]
start_urls = ()

def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

def parse(self, response):
res = response.xpath('//link[@type="application/rss+xml"]/@href')
for sel in res:
item = rssitem()
item['sourceurl']=response.url
item['rssurl']=sel.extract()
yield item

pass

Answer

As I commented you should use generators to avoid creating lists of objects in memory(what-does-the-yield-keyword-do-in-python), using generators objects are created lazily so you don't create large lists of objects all in memory at once:

def get_urls_from_csv():
    with open('data.csv', newline='') as csv_file:
        data = csv.reader(csv_file, delimiter=',')
        for row in data:
            yield http://"+row[2]) # yield each url lazily


class rssitem(scrapy.Item):
    sourceurl = scrapy.Field()
    rssurl = scrapy.Field()


class RssparserSpider(scrapy.Spider):
    name = "rssspider"
    allowed_domains = ["*"]
    start_urls = ()

    def start_requests(self):
        # return a generator expresion.
        return (scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv())

    def parse(self, response):
        res = response.xpath('//link[@type="application/rss+xml"]/@href')
        for sel in res:
            item = rssitem()
            item['sourceurl']=response.url
            item['rssurl']=sel.extract()
            yield item

As far as performance goes, what the docs on Broad Crawls suggest is to try to increase concurrency is:

Concurrency is the number of requests that are processed in parallel. There is a global limit and a per-domain limit. The default global concurrency limit in Scrapy is not suitable for crawling many different domains in parallel, so you will want to increase it. How much to increase it will depend on how much CPU you crawler will have available. A good starting point is 100, but the best way to find out is by doing some trials and identifying at what concurrency your Scrapy process gets CPU bounded. For optimum performance, you should pick a concurrency where CPU usage is at 80-90%.

To increase the global concurrency use:

CONCURRENT_REQUESTS = 100

emphasis mine.

Also Increase Twisted IO thread pool maximum size:

Currently Scrapy does DNS resolution in a blocking way with usage of thread pool. With higher concurrency levels the crawling could be slow or even fail hitting DNS resolver timeouts. Possible solution to increase the number of threads handling DNS queries. The DNS queue will be processed faster speeding up establishing of connection and crawling overall.

To increase maximum thread pool size use:

 REACTOR_THREADPOOL_MAXSIZE = 20