Cristian Calin Cristian Calin - 4 months ago 10
JSON Question

Scrapy returns more results than expected

This is a continuation of the question: Extract from dynamic JSON response with Scrapy

I have a Scrapy spider that extract values from a JSON response. It works well, extract the right values, but somehow it enters in a loop and returns more results than expected (duplicate results).

For example for 17 values provided in

test.txt
file it returns
289
results, that means
17 times more
than expected.

Spider content below:

import scrapy
import json
from whois.items import WhoisItem

class whoislistSpider(scrapy.Spider):
name = "whois_list"
start_urls = []
f = open('test.txt', 'r')
global lines
lines = f.read().splitlines()
f.close()
def __init__(self):
for line in lines:
self.start_urls.append('http://www.example.com/api/domain/check/%s/com' % line)

def parse(self, response):
for line in lines:
jsonresponse = json.loads(response.body_as_unicode())
item = WhoisItem()
domain_name = list(jsonresponse['domains'].keys())[0]
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
item["domain"] = domain_name
yield item


items.py content below

import scrapy

class WhoisItem(scrapy.Item):
avail = scrapy.Field()
domain = scrapy.Field()


pipelines.py below

class WhoisPipeline(object):
def process_item(self, item, spider):
return item


Thank you in advance for all the replies.

Answer

The parse function should be like this:

def parse(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    item = WhoisItem()
    domain_name = list(jsonresponse['domains'].keys())[0]
    item["avail"] = jsonresponse["domains"][domain_name]["avail"]
    item["domain"] = domain_name
    yield item

Notice that I removed the for loop.

What was happening: for every single response you would loop and parse it 17 times. (Therefore resulting in 17*17 records)