Luke Skywalker Luke Skywalker - 2 months ago 21
Python Question

urllib read() changing attributes

I have a basic script, which is requesting websites to get the html source code.
while crawling several websites I figured out that different attributes in the source code are being represented wrong.

Example:

from urllib import request

opener = request.build_opener()
with opener.open("https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2") as response:
html = response.read()
print(html)


I compared the results (
html
var) with the source code being represented by Chrome and Firefox.

I saw differences like these:

Browser Urllib

href='rfc2616.html' href=\'rfc2616.html\'
rev='Section' rev=\'Section\'
rel='xref' rel=\'xref\'
id='sec4.5' id=\'sec4.4\'


It looks like
urllib
is putting backslashes here to escape code.

Is this a bug deep inside
urllib
or is there any way to fix this problem?

Thanks in advance.

Jim Jim
Answer

responce.read() will return a bytes object, when printed its escape sequences don't get interpreted, see:

print(b'hello\nworld') # prints b'hello\nworld'

You'll need to decode it to str which, when printed, evaluates the escapes correctly:

print(html.decode())
Comments