I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.
I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.
I can build a unit-test test class and in a test:
The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
# scrapyproject/tests/responses/__init__.py import os from scrapy.http import Response, Request def fake_response_from_file(file_name, url=None): """ Create a Scrapy fake HTTP response from a HTML file @param file_name: The relative filename from the responses directory, but absolute paths are also accepted. @param url: The URL of the response. returns: A scrapy HTTP response which can be used for unittesting. """ if not url: url = 'http://www.example.com' request = Request(url=url) if not file_name == '/': responses_dir = os.path.dirname(os.path.realpath(__file__)) file_path = os.path.join(responses_dir, file_name) else: file_path = file_name file_content = open(file_path, 'r').read() response = Response(url=url, request=request, body=file_content) response.encoding = 'utf-8' return response
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows: The test case location is scrapyproject/tests/test_osdir.py
import unittest from scrapyproject.spiders import osdir_spider from responses import fake_response_from_file class OsdirSpiderTest(unittest.TestCase): def setUp(self): self.spider = osdir_spider.DirectorySpider() def _test_item_results(self, results, expected_length): count = 0 permalinks = set() for item in results: self.assertIsNotNone(item['content']) self.assertIsNotNone(item['title']) self.assertEqual(count, expected_length) def test_parse(self): results = self.spider.parse(fake_response_from_file('osdir/sample.html')) self._test_item_results(results, 10)
That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox