Luke Skywalker Luke Skywalker - 11 months ago 80
Python Question

urllib read() changing attributes

I have a basic script, which is requesting websites to get the html source code.
while crawling several websites I figured out that different attributes in the source code are being represented wrong.


from urllib import request

opener = request.build_opener()
with"") as response:
html =

I compared the results (
var) with the source code being represented by Chrome and Firefox.

I saw differences like these:

Browser Urllib

href='rfc2616.html' href=\'rfc2616.html\'
rev='Section' rev=\'Section\'
rel='xref' rel=\'xref\'
id='sec4.5' id=\'sec4.4\'

It looks like
is putting backslashes here to escape code.

Is this a bug deep inside
or is there any way to fix this problem?

Thanks in advance.

Jim Jim
Answer Source will return a bytes object, when printed its escape sequences don't get interpreted, see:

print(b'hello\nworld') # prints b'hello\nworld'

You'll need to decode it to str which, when printed, evaluates the escapes correctly: