ciferkey ciferkey - 6 months ago 102
Python Question

Scrapy Unit Testing

I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.

Update:

I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.

I can build a unit-test test class and in a test:


  • create a response object

  • try to call the parse method of my spider with the response object



However it ends up generating this traceback. Any insight as to why?

Answer

The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.

A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.

My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.

This is the code I use to create sample Scrapy http responses for testing from an local html file:

# scrapyproject/tests/responses/__init__.py

import os

from scrapy.http import Response, Request

def fake_response_from_file(file_name, url=None):
    """
    Create a Scrapy fake HTTP response from a HTML file
    @param file_name: The relative filename from the responses directory,
                      but absolute paths are also accepted.
    @param url: The URL of the response.
    returns: A scrapy HTTP response which can be used for unittesting.
    """
    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)
    if not file_name[0] == '/':
        responses_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(responses_dir, file_name)
    else:
        file_path = file_name

    file_content = open(file_path, 'r').read()

    response = Response(url=url,
        request=request,
        body=file_content)
    response.encoding = 'utf-8'
    return response

The sample html file is located in scrapyproject/tests/responses/osdir/sample.html

Then the testcase could look as follows: The test case location is scrapyproject/tests/test_osdir.py

import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file

class OsdirSpiderTest(unittest.TestCase):

    def setUp(self):
        self.spider = osdir_spider.DirectorySpider()

    def _test_item_results(self, results, expected_length):
        count = 0
        permalinks = set()
        for item in results:
            self.assertIsNotNone(item['content'])
            self.assertIsNotNone(item['title'])
        self.assertEqual(count, expected_length)

    def test_parse(self):
        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
        self._test_item_results(results, 10)

That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox