Parab00n Parab00n - 1 year ago 240
Python Question

Removing HTML tags without /text().extract()

To start, I'm very new at all this so get ready for some jacked up code from me copying/pasting from all kinds of sources.

I'm looking to be able to remove any html code that scrapy returns. I've got everything storing in MySQL with no issues, but the thing I can't get to work yet is removing a lot of '< td >' and other html tags. I initially just ran with /text().extract() but randomly it would come across a cell that was formatted this way:

<td> <span class="caps">TEXT</span> </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>

There isn't a pattern to it that I can just choose between using /text or not, I'm looking for the easiest way that a beginner can implement that will strip all that off.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
import html2text
from scraper.items import LivingSocialDeal

class CFBDVRB(BaseSpider):
name = "cfbdvrb"
allowed_domains = ["url"]
start_urls = [

deals_list_xpath = '//table[@class="tbl data-table"]/tbody/tr'
item_fields = {
'title': './/td[1]',
'link': './/td[2]',
'location': './/td[3]',
'original_price': './/td[4]',
'price': './/td[5]',

def parse(self, response):
selector = HtmlXPathSelector(response)

for deal in selector.xpath(self.deals_list_xpath):
loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

# define processors
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()

# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)

converter = html2text.HTML2Text()
converter.ignore_links = True
yield loader.load_item()

The converter = html2text was my last attempt at removing it that way, I'm not entirely sure I implemented it correctly but it didn't work.

Thanks in advance for any help you would like to give and I also apologize if I'm missing something easy that a quick search could pull up.

Answer Source

The authors of Scrapy use a bunch of this functionality in their w3lib which is part of/included with Scrapy.

Based on your code, you're using a pretty dated version of Scrapy (pre 0.22). I'm not sure exactly what's available to you, so you may need to import from scrapy.utils.markup instead

If you have the variable my_text that has your HTML text in it, do the following:

>>> from w3lib.html import remove_tags
>>> my_text
'<td>    <span class="caps">TEXT</span>  </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>'
>>> remove_tags(my_text)
u'    TEXT  \n    Text    \n    Text    \n    Text    \n    Text    '

There's a lot of additionally functionality for fixing up/converting html/markup with w3lib (code available here).

As this is just a function, it will be pretty easy to incorporate into your item loader, and will be more lightweight than using BS4.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download