I am trying to write a Python script which acts similar to Ctrl + S on a Chrome web browser, it saves the HTML page, downloads any links on the webpage and finally, replaces the URIs of the links with the local path on disk.
The code posted below attempts to replace the URIs in for CSS files with local paths on my computer.
I have come across an issue when attempting to parse different sites, and it's becoming a bit of a headache.
The original error code I have is
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 13801: ordinal not in range(128)
url = 'http://www.s1jobs.com/job/it-telecommunications/support/edinburgh/620050561.html'
response = urllib2.urlopen(url)
webContent = response.read()
dest_dir = 'C:/Users/Stuart/Desktop/' + title
for f in glob.glob(r'./*.css'):
newContent = webContent.replace(cssUri, "./" + title + '/' + cssFilename)
newContent = webContent.decode('utf-8').replace(cssUri, "./" + title + '/' + cssFilename)
newContent = webContent.decode(utf-8).replace(cssUri, "./" + title + '/' + cssFilename)
13801: invalid start byte
byte 0x0a in position 44442: truncated data
can't decode bytes in position 0-3: code point not in range(0x110000)
webContent[13801:13850] has some weird characters. Just ignore them.
This is kind of a shot at the dark, but try this :
At the top of your file,
from __future__ import unicode_literals from builtins import str
It appears what's happening is that you're attempting to decode a python object that is probably a python 2.7
str object, which in principle, should be some decoded text object.
In the default python 2.7 kernel :
In : type("é") # By default, quotes in py2 create py2 strings, which is the same thing as a sequence of bytes that given some encoding, can be decoded to a character in that encoding. Out: str In : type("é".decode("utf-8")) # We can get to the actual text data by decoding it if we know what encoding it was initially encoded in, utf-8 is a safe guess in almost every country but Myanmar. Out: unicode In : len("é") # Note that the py2 `str` representation has a length of 2. There's one byte for the "e" and one byte for the accent. Out: 2 In : len("é".decode("utf-8")) # the py2 `unicode` representation has length 1, since an accented e is a single character Out: 1
Some other things of note in python 2.7:
"é"is the same thing as
u"é"is the same thing as
u"é".encode('utf-8')is the same thing as
str, and encode with py2
str, which is the same as python2
unicode, can no longer be decoded since a string by definition is a decoded sequence of bytes. By default, it uses the utf-8 encoding.
type("a".decode('ascii'))gives a unicode object, but this behaves nearly identically with
str("a"). This is not the case in python3.
With that said, here's what the snippets above do :
__future__is a module maintained by the core python team that backports python3 functionality to python2 to allow you to use python3 idioms within python2.
from __future__ import unicode_literalshas the following effect :
"é"is the same thing as
"é"is functionally the same thing as
builtinsis a module that is approved by the core python team, and contains safe aliases for using python3 idioms in python2 with the python3 api.
builtinsmodule you run :
pip install future
from builtins import strhas the following effect :
strconstructor now gives what you think it does, i.e. text data in the form of python2 unicode objects. So it's functionally the same thing as
str = unicode
stris functionally the same as Python2
The takeaway is this :
strobjects for bytes and
unicodeobjects for text