Onyi Lam Onyi Lam - 2 months ago 29
Python Question

passing selenium response url to scrapy

I am learning Python and am trying to scrape this page for a specific value on the dropdown menu. After that I need to click each item on the resulted table to retrieve the specific information. I am able to select the item and retrieve the information on the webdriver. But I do not know how to pass the response url to the crawlspider.

driver = webdriver.Firefox()
driver.get('http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp')
more_btn = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.ID, '_button_select'))
)
more_btn.click()

## select specific value from the dropdown
driver.find_element_by_css_selector("select#tabJcwyxt_jiebie > option[value='teyaoxgrs']").click()
driver.find_element_by_css_selector("select#tabJcwyxt_jieci > option[value='d11jie']").click()
search2 = driver.find_element_by_class_name('input_a2')
search2.click()
time.sleep(5)

## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)

## this is a hack that initiates a "TextResponse" object (taken from the Scrapy module)
resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)

## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)

resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)


So this is where I am stuck. I was able to query using the above code. But How can I pass resp_for_scrapy to the crawlspider? I put resp_for_scrapy in place of item but that didn't work.

## spider
class ProfileSpider(CrawlSpider):
name = 'pccprofile2'
allowed_domains = ['cppcc.gov.cn']
start_urls = ['http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp']

def parse(self, resp_for_scrapy):

hxs = HtmlXPathSelector(resp_for_scrapy)
for post in resp_for_scrapy.xpath('//div[@class="table"]//ul//li'):
items = []
item = Ppcprofile2Item()
item ["name"] = hxs.select("//h1/text()").extract()
item ["title"] = hxs.select("//div[@id='contentbody']//tr//td//text()").extract()
items.append(item)

##click next page
while True:
next = self.driver.findElement(By.linkText("下一页"))
try:
next.click()
except:
break

return(items)


Any suggestions would be greatly appreciated!!!!

EDITS I included a middleware class to select from the dropdown before the spider class. But now there is no error and no result.

class JSMiddleware(object):
def process_request(self, request, spider):
driver = webdriver.PhantomJS()
driver.get('http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp')


# select from the dropdown
more_btn = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.ID, '_button_select'))
)
more_btn.click()


driver.find_element_by_css_selector("select#tabJcwyxt_jiebie > option[value='teyaoxgrs']").click()
driver.find_element_by_css_selector("select#tabJcwyxt_jieci > option[value='d11jie']").click()
search2 = driver.find_element_by_class_name('input_a2')
search2.click()
time.sleep(5)

#get the response
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)



class ProfileSpider(CrawlSpider):
name = 'pccprofile2'
rules = [Rule(SgmlLinkExtractor(allow=(),restrict_xpaths=("//div[@class='table']")), callback='parse_item')]

def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Ppcprofile2Item()
item ["name"] = hxs.select("//h1/text()").extract()
item ["title"] = hxs.select("//div[@id='contentbody']//tr//td//text()").extract()
items.append(item)

#click next page
while True:
next = response.findElement(By.linkText("下一页"))
try:
next.click()
except:
break

return(items)

Answer

Use Downloader Middleware to catch selenium-required pages before you process them regularly with Scrapy:

The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.

Here's a very basic example using PhantomJS:

from scrapy.http import HtmlResponse
from selenium import webdriver

class JSMiddleware(object):
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS()
        driver.get(request.url)

        body = driver.page_source
        return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)

Once you return that HtmlResponse (or a TextResponse if that's what you really want), Scrapy will cease processing downloaders and drop into the spider's parse method:

If it returns a Response object, Scrapy won’t bother calling any other process_request() or process_exception() methods, or the appropriate download function; it’ll return that response. The process_response() methods of installed middleware is always called on every response.

In this case, you can continue to use your spider's parse method as you normally would with HTML, except that the JS on the page has already been executed.

Tip: Since the Downloader Middleware's process_request method accepts the spider as an argument, you can add a conditional in the spider to check whether you need to process JS at all, and that will let you handle both JS and non-JS pages with the exact same spider class.