Raj Raj - 3 days ago 6
Python Question

Scrapy Python - How to Pass URL and retrieve URL for Scraping

I have very little programming experience with python more with Java.

I am trying to get into python and having problems with understanding a scrapy web crawler I am trying to setup.

The script will scrape products etc from the site and put them into a file and recursively go through all landing domains within the site but stop at a specified depth.

I'm having trouble understanding how I can pass a URL exectued within a script to an example of scrapy I found.

Code that executes my spider:

Scrappy Code is here --------------------------------



process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(UrlScrappyRunner, domain="www.google.com")
process.start()


My Spider:

class UrlScrappyRunner(scrapy.Spider):

name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)


Please can you let me know how to pass the domain=www.google.com to my spider so it crawls google rather than quotes.toscrape.com?

Answer

You can use argumets -a in scrapy to pass the user defined values

class UrlScrappyRunner(scrapy.Spider):
            name = "quotes"

           def __init__(self, domain=None, *args, **kwargs):
                self.domain = domain

            def start_requests(self):
                urls = self.domain

to run with argument

scrapy crawl UrlScrappyRunner -a domain="www.google.com"

to run from process:

process.crawl(UrlScrappyRunner, domain="www.google.com")

add __init__ in your code and assign the domain value in you class variable

Comments