user61629 user61629 - 1 month ago 19
Python Question

Scrapy opening html in editor , not browser

enter image description here

I am working on some code which returns an

HTML
string (
my_html
). I want to see how this looks in a browser using https://doc.scrapy.org/en/latest/topics/debug.html#open-in-browser. I just asked a question on this (Scrapy - How to load html string into open_in_browser function) and the answers have shown me how to load the string into the 'open_in_browser object'

headers = {
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36",

'referer': "http://civilinquiry.jud.ct.gov/GetDocket.aspx",
'accept-encoding': "gzip, deflate, sdch",
'accept-language': "en-US,en;q=0.8",
'cache-control': "no-cache",
}

new_response = TextResponse('https://doc.scrapy.org/en/latest/topics/request-response.html#response-objects', headers=headers, body='<html><body>Oh yeah!</body></html>')
open_in_browser(new_response)


However now I'm seeing the text open up in notepad rather than the browser (on my windows system) making me think that the system thinks this is a text string not html (even though it has an outer html tag):

How can I get this working?

edit:

I changed the code to

new_response = Response('https://doc.scrapy.org/en/latest/topics/request-response.html#response-objects', headers=headers, body='<html><body>Oh yeah!</body></html>')


Now getting:

TypeError: Unsupported response type: Response


edit2:

I realized that in my version of scrapy (1.1) the source is:

def open_in_browser(response, _openfunc=webbrowser.open):
"""Open the given response in a local web browser, populating the <base>
tag for external links to work
"""
from scrapy.http import HtmlResponse, TextResponse
# XXX: this implementation is a bit dirty and could be improved
body = response.body
if isinstance(response, HtmlResponse):
if b'<base' not in body:
repl = '<head><base href="%s">' % response.url
body = body.replace(b'<head>', to_bytes(repl))
ext = '.html'
elif isinstance(response, TextResponse):
ext = '.txt'
else:
raise TypeError("Unsupported response type: %s" %
response.__class__.__name__)
fd, fname = tempfile.mkstemp(ext)
os.write(fd, body)
os.close(fd)
return _openfunc("file://%s" % fname)


I changed Response to HTMLresponse and it started working. Thank you

Answer

Look at how the open_in_browser() function is defined:

if isinstance(response, HtmlResponse):
    if b'<base' not in body:
        repl = '<head><base href="%s">' % response.url
        body = body.replace(b'<head>', to_bytes(repl))
    ext = '.html'
elif isinstance(response, TextResponse):
    ext = '.txt'
else:
    raise TypeError("Unsupported response type: %s" %
                    response.__class__.__name__)

It would create a .txt file if it is getting a TextResponse - that's why you see the notepad opening the file with your HTML inside.

Instead, you need to initialize the regular scrapy.Response object and pass it to open_in_browser().

Or, you can create the temp HTML file with the desired contents manually and then using the file:// protocol open it in a default browser through the webbrowser.open().

Comments