n6g7 n6g7 - 1 year ago 111
Python Question

Scraping text without javascript code using scrapy

I'm currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the target sites.

The problem is : sometimes, my target node contains a

tag and so the scraped text contains javascript code.

Here is a link to a real example of what I'm working with. In this case my target node is
. The problem is that there's a
tag in the first child div.

I've spent a lot of time searching for a solution on the web and on SO, but I couldn't find anything. I hope I haven't missed something obvious !


HTML response (only the target node) :

<div id="content">
<div id="part1">Some text</div>
<script>var s = 'javascript I don't want';</script>
<div id="part2">Some other text</div>

What I want in my item :

Some text
Some other text

What I get :

Some text
var s = 'javascript I don't want';
Some other text

My code

Given an xpath selector I'm using the following function to extract the text :

def getText(hxs):
if len(hxs) > 0:
l = hxs.select('string(.)')
if len(l) > 0:
s = l[0].extract().encode('utf-8')
s = hxs[0].extract().encode('utf-8')
return s
return 0

I've tried using XPath axes (things like
) but to no avail.

kev kev
Answer Source

Try utils functions from w3lib.html:

from w3lib.html import remove_tags, remove_tags_with_content

input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download