RealMonia RealMonia - 2 months ago 7
Python Question

How to modify my code to scrape these links?

I am new to use python scrapy, and my scrapy version is 1.1.3. I want to get a link list in this part on https://www.wikipedia.org/. How should I modify my code?

import scrapy

class LinkSpider(scrapy.Spider):
name = "links"
start_urls = [
'https://www.wikipedia.org/',
]

def parse(self, response):
for link in response.xpath('//div/ul/li/a'):
yield{
'link': link.extract()
}


Above is my code in my project folder/spiders/spiders.py

What I get is

[
{"link": "<a href=\"//de.wikipedia.org/\" lang=\"de\">Deutsch</a>"},
{"link": "<a href=\"//en.wikipedia.org/\" lang=\"en\" title=\"English\">English</a>"},
{"link": "<a href=\"//es.wikipedia.org/\" lang=\"es\">Espa\u00f1ol</a>"},
{"link": "<a href=\"//fr.wikipedia.org/\" lang=\"fr\">Fran\u00e7ais</a>"},
{"link": "<a href=\"//it.wikipedia.org/\" lang=\"it\">Italiano</a>"},
{"link": "<a href=\"//nl.wikipedia.org/\" lang=\"nl\">Nederlands</a>"},
{"link": "<a href=\"//ja.wikipedia.org/\" lang=\"ja\" title=\"Nihongo\">\u65e5\u672c\u8a9e</a>"},
{"link": "<a href=\"//pl.wikipedia.org/\" lang=\"pl\">Polski</a>"},
{"link": "<a href=\"//ru.wikipedia.org/\" lang=\"ru\" title=\"Russkiy\">\u0420\u0443\u0441\u0441\u043a\u0438\u0439</a>"},
{"link": "<a href=\"//ceb.wikipedia.org/\" lang=\"ceb\">Sinugboanong Binisaya</a>"}
]


and I expect something like a list only contains links like "//de.wikipedia.org/".

Answer

You need to modify the xpath query to get the value of the attribute not the tag

import scrapy

class LinkSpider(scrapy.Spider):
    name = "links"
    start_urls = [
        'https://www.wikipedia.org/',
    ]

    def parse(self, response):
        for link in response.xpath('//div/ul/li/a/@href'): 
            yield{
                'link': link.extract()
            }