Lewis Lewis - 1 year ago 62
JSON Question

Scrapy only outputs '['

I am building a web crawler using scrapy that just takes all reddit links of the front page. When I try to put it into a json folder all I get is '['.

Here is my spider.

from scrapy import Spider
from scrapy.selector import Selector
from redditScrape.items import RedditscrapeItem

class RedditSpider(Spider):
name = "redditScrape"
allowed_domains = ["reddit.com"]
start_urls = [

def parse(self, response):
titles = Selector(response).xpath('//div[@class="entry unvoted lcTagged"]/p[@class="title"]')

for title in titles:
item = RedditscrapeItem()
item['title'] = title.xpath('/a[@class="title may-blank loggedin srTagged imgScanned"]/text()').extract()
yield item

Whenever I run the xpath query in my google chrome console I get the result im looking for.

enter image description here

Any idea why my scraper wont output correctly?

This is the command I am using to execute:

scrapy crawl redditScrape -o items.json -t json

Answer Source

I don't know exactly what the problem is but I will have a go at what I see wrong in your code.

  • First off, I don't know what the -t argument is but I suspect you wanted to reassure that the output was a json file. You don't need to. -o items.json is enough. scrapy crawl redditScrape -o items.json

  • You don't need to declare Selector you can just as well do titles = response.xpath('//div[@class="entry unvoted lcTagged"]/p[@class="title"]'). This is not an error as much as it is a quality of life improvement.

  • The second xpath is shady to say the least item['title'] = title.xpath('a[@class="title may-blank loggedin srTagged imgScanned"]/text()').extract_first()

Whenever an item is successfully yield, scrapy will add it to the output file in runtime.


You can simply use this xpath //p[@class="title"]/a/text() to get all the titles from the front page. In your code it will look something like this

    for title in response.xpath('//p[@class="title"]/a'):
        item = RedditscrapeItem()
        item['title'] = title.xpath('text()').extract_first()
        yield item
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download