inix inix - 4 years ago 178
HTML Question

Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors?

For example:

scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content


then,I got following raw html codes:

<div id="content">


<h2>Welcome to Scrapy</h2>

<h3>What is Scrapy?</h3>

<p>Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from their
pages. It can be used for a wide range of purposes, from data mining to
monitoring and automated testing.</p>

<h3>Features</h3>

<dl>

<dt>Simple</dt><dt>
</dt><dd>Scrapy was designed with simplicity in mind, by providing the features
you need without getting in your way</dd>

<dt>Productive</dt>
<dd>Just write the rules to extract the data from web pages and let Scrapy
crawl the entire web site for you</dd>

<dt>Fast</dt>
<dd>Scrapy is used in production crawlers to completely scrape more than
500 retailer sites daily, all in one server</dd>

<dt>Extensible</dt>
<dd>Scrapy was designed with extensibility in mind and so it provides
several mechanisms to plug new code without having to touch the framework
core

</dd><dt>Portable, open-source, 100% Python</dt>
<dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>

<dt>Batteries included</dt>
<dd>Scrapy comes with lots of functionality built in. Check <a href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
section</a> of the documentation for a list of them.</dd>

<dt>Well-documented &amp; well-tested</dt>
<dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite
with <a href="http://static.scrapy.org/coverage-report/">very good code
coverage</a></dd>

<dt><a href="/community">Healthy community</a></dt>
<dd>
1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>
700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>
850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>
200 messages per month on mailing list (<a href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>
40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)
</dd>

<dt><a href="/support">Commercial support</a></dt>
<dd>A few companies provide Scrapy consulting and support</dd>

<p>Still not sure if Scrapy is what you're looking for?. Check out <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
glance</a>.

</p><h3>Companies using Scrapy</h3>

<p>Scrapy is being used in large production environments, to crawl
thousands of sites daily. Here is a list of <a href="/companies/">Companies
using Scrapy</a>.</p>

<h3>Where to start?</h3>

<p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,
then <a href="/download/">download Scrapy</a> and follow the <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.


</p></dl></div>


---------->But I want to get plain text like following directly from scrapy:-----





Welcome to Scrapy



What is Scrapy?



Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from
their pages. It can be used for a wide range of purposes, from data
mining to monitoring and automated testing.

Features



Simple
Scrapy was designed with simplicity
in mind, by providing the features you need without getting in your
way


Productive
Just write the rules to extract the data from
web pages and let Scrapy crawl the entire web site for you


Fast
Scrapy is used in production crawlers to completely
scrape more than 500 retailer sites daily, all in one server


Extensible
Scrapy was designed with extensibility in mind
and so it provides several mechanisms to plug new code without having
to touch the framework core

Portable, open-source, 100% Python
Scrapy is
completely written in Python and runs on Linux, Windows, Mac and
BSD


Batteries included
Scrapy comes with lots of
functionality built in. Check this
section of the documentation for a list of them.


Well-documented & well-tested
Scrapy is extensively documented and has an comprehensive test
suite with very
good code coverage


Healthy community
1,500
watchers, 350 forks on Github (link)
700 followers on
Twitter (link)
850
questions on StackOverflow (link)
200
messages per month on mailing list (link)

40-50 users always connected to IRC channel (link)


Commercial support
A few companies
provide Scrapy consulting and support


Still not sure if Scrapy is what you're looking for?. Check out Scrapy at a
glance.

Companies using Scrapy



Scrapy is being used in large production environments, to crawl
thousands of sites daily. Here is a list of Companies using Scrapy.

Where to start?



Start by reading Scrapy at a
glance, then download Scrapy and follow
the Tutorial.





I do not want to use any xPath selectors to extract those p, h2, h3 etc,tags,since I am crawling a website whose main content is embedded into a table, tbody; recursively. It can be a tedious task to find those xPath. Can this be implemented by a built in function in Scrapy? Or do I need external tools to convert it? I have read through all of Scrapy's docs, but have gained nothing. This is a sample site which can convert raw html into plain text: http://beaker.mailchimp.com/html-to-text

Answer Source

Scrapy doesn't have such functionality built-in. html2text is what you are looking for.

Here's a sample spider that scrapes wikipedia's python page, gets first paragraph using xpath and converts html into plain text using html2text:

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text


class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]

        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print converter.handle(sample)

prints:

**Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download