I am building a web crawler using scrapy that just takes all reddit links of the front page. When I try to put it into a json folder all I get is '['.
Here is my spider.
from scrapy import Spider
from scrapy.selector import Selector
from redditScrape.items import RedditscrapeItem
name = "redditScrape"
allowed_domains = ["reddit.com"]
start_urls = [
def parse(self, response):
titles = Selector(response).xpath('//div[@class="entry unvoted lcTagged"]/p[@class="title"]')
for title in titles:
item = RedditscrapeItem()
item['title'] = title.xpath('/a[@class="title may-blank loggedin srTagged imgScanned"]/text()').extract()
scrapy crawl redditScrape -o items.json -t json
I don't know exactly what the problem is but I will have a go at what I see wrong in your code.
First off, I don't know what the
-t argument is but I suspect you wanted to reassure that the output was a json file. You don't need to.
-o items.json is enough.
scrapy crawl redditScrape -o items.json
You don't need to declare
Selector you can just as well do
titles = response.xpath('//div[@class="entry unvoted lcTagged"]/p[@class="title"]'). This is not an error as much as it is a quality of life improvement.
The second xpath is shady to say the least
item['title'] = title.xpath('a[@class="title may-blank loggedin srTagged imgScanned"]/text()').extract_first()
Whenever an item is successfully yield, scrapy will add it to the output file in runtime.
You can simply use this xpath
//p[@class="title"]/a/text() to get all the titles from the front page. In your code it will look something like this
for title in response.xpath('//p[@class="title"]/a'): item = RedditscrapeItem() item['title'] = title.xpath('text()').extract_first() yield item