mangoHero1 mangoHero1 - 1 month ago 9
JSON Question

IndexError when using Scrapy for absolute links

I am scraping a webpage from Wikipedia (particularly this one) using a Python library called Scrapy. Here was the original code which successfully crawled the page:

import scrapy
from wikipedia.items import WikipediaItem


class MySpider(scrapy.Spider):
name = "wiki"
allowed_domains = ["en.wikipedia.org/"]
start_urls = [
'https://en.wikipedia.org/wiki/Category:2013_films',
]

def parse(self, response):
titles = response.xpath('//div[@id="mw-pages"]//li')
items = []
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = title.xpath("a/@href").extract()
items.append(item)
return items


Then in the terminal, I ran
scrapy crawl wiki -o wiki.json -t json
to output the data to a JSON file. While the code worked, the links assigned to the "url" keys were all relative links. (i.e.:
{"url": ["/wiki/9_Full_Moons"], "title": ["9 Full Moons"]}
).

Instead of /wiki/9_Full_Moons, I needed http://en.wikipedia.org/wiki/9_Full_Moons. So I modified the above mentioned code to import the urljoin from the urlparse library. I also modified my
for
loop to look like this instead:

for title in titles:
item = WikipediaItem()
url = title.xpath("a/@href").extract()
item["title"] = title.xpath("a/text()").extract()
item["url"] = urljoin("http://en.wikipedia.org", url[0])
items.append(item)
return(items)


I believed this was the correct approach since the type of data assigned to the
url
key is enclosed in brackets (which would entail a list, right?) so to get the string inside it, I typed url[0]. However, this time I got an IndexError that looked like this:


IndexError: list index out of range


Can someone help explain where I went wrong?

Answer

So after mirroring the code to the example given in the documentation here, I was able to get the code to work:

def parse(self, response):
    for text in response.xpath('//div[@id="mw-pages"]//li/a/text()').extract():
        yield WikipediaItem(title=text)
    for href in response.xpath('//div[@id="mw-pages"]//li/a/@href').extract():
        link = urljoin("http://en.wikipedia.org", href)
        yield WikipediaItem(url=link)

If anyone needs further clarification on how the Items class works, the documentation is here.

Furthermore, although the code works, it won't pair the title with its respective link. So it will give you

TITLE, TITLE, TITLE, LINK, LINK, LINK

instead of

TITLE, LINK, TITLE, LINK, TITLE, LINK

(the latter being probably the more desired result) — but that's for another question. If anyone has a proposed solution that works better than mine, I'll be more than happy to listen to your answers! Thanks.