I have a basic script, which is requesting websites to get the html source code.
while crawling several websites I figured out that different attributes in the source code are being represented wrong.
from urllib import request
opener = request.build_opener()
with opener.open("https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2") as response:
html = response.read()
responce.read() will return a
bytes object, when printed its escape sequences don't get interpreted, see:
print(b'hello\nworld') # prints b'hello\nworld'
You'll need to
decode it to
str which, when printed, evaluates the escapes correctly: