Coding_Rabbit Coding_Rabbit - 5 months ago 82
HTML Question

How to get full link text with Scrapy

I've use scrapy to get datas from webpage.And I encountered a problem as below.

<a href="NEW-IMAGE?type=GENE&amp;object=EG10567">
X -
Escherichia coli

In webpage,the record's name looks like this:
enter image description here

I want to get the content (e.g.:man X-Escherichia coli) in the
tag and don't want to get other tags. And Here is my code:

def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li/a[contains(@href,"NEW-IMAGE")]')
base_url = ""
for site in sites:
item = MetaCyc()
name_tmp = map(unicode.strip, site.xpath('text()').extract())
item['Name'] = unicode(name_tmp).encode('utf-8')
item['Link'] = map(unicode.strip, site.xpath('@href').extract())
yield item

I have tried to convert the unicode to utf-8, but the results still looks like this:

{"Link": ["NEW-IMAGE?type=GENE&object=EG10567"], "Name": "[u'X -']"}

Sometimes there will have some character missing in the records.
So I want to know how to get the complete and correct format data from HTML code.


I suggest you use XPath's normalize-space()

The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are the same as those allowed by the S production in XML. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.

>>> html = """<li>
... <a href="NEW-IMAGE?type=GENE&amp;object=EG10567">
... <b>
... man
... </b>
... X -
... <i>
... Escherichia coli
... </i>
... </a>
... <br>
... </li>"""
>>> import scrapy
>>> selector = scrapy.Selector(text=html)

>>> links = selector.xpath('//li/a[contains(@href,"NEW-IMAGE")]')
>>> for link in links:
...     item = {}
...     item['Name'] = link.xpath('normalize-space(.)').extract_first()
...     item['Link'] = link.xpath('@href').extract_first()
...     print(item)
{'Link': u'NEW-IMAGE?type=GENE&object=EG10567', 'Name': u'man X - Escherichia coli'}