mangoHero1 mangoHero1 - 1 year ago 114
JSON Question

IndexError when using Scrapy for absolute links

I am scraping a webpage from Wikipedia (particularly this one) using a Python library called Scrapy. Here was the original code which successfully crawled the page:

import scrapy
from wikipedia.items import WikipediaItem

class MySpider(scrapy.Spider):
name = "wiki"
allowed_domains = [""]
start_urls = [

def parse(self, response):
titles = response.xpath('//div[@id="mw-pages"]//li')
items = []
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = title.xpath("a/@href").extract()
return items

Then in the terminal, I ran
scrapy crawl wiki -o wiki.json -t json
to output the data to a JSON file. While the code worked, the links assigned to the "url" keys were all relative links. (i.e.:
{"url": ["/wiki/9_Full_Moons"], "title": ["9 Full Moons"]}

Instead of /wiki/9_Full_Moons, I needed So I modified the above mentioned code to import the urljoin from the urlparse library. I also modified my
loop to look like this instead:

for title in titles:
item = WikipediaItem()
url = title.xpath("a/@href").extract()
item["title"] = title.xpath("a/text()").extract()
item["url"] = urljoin("", url[0])

I believed this was the correct approach since the type of data assigned to the
key is enclosed in brackets (which would entail a list, right?) so to get the string inside it, I typed url[0]. However, this time I got an IndexError that looked like this:

IndexError: list index out of range

Can someone help explain where I went wrong?

Answer Source

So after mirroring the code to the example given in the documentation here, I was able to get the code to work:

def parse(self, response):
    for text in response.xpath('//div[@id="mw-pages"]//li/a/text()').extract():
        yield WikipediaItem(title=text)
    for href in response.xpath('//div[@id="mw-pages"]//li/a/@href').extract():
        link = urljoin("", href)
        yield WikipediaItem(url=link)

If anyone needs further clarification on how the Items class works, the documentation is here.

Furthermore, although the code works, it won't pair the title with its respective link. So it will give you


instead of


(the latter being probably the more desired result) — but that's for another question. If anyone has a proposed solution that works better than mine, I'll be more than happy to listen to your answers! Thanks.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download