user3019191 user3019191 - 6 months ago 15
HTML Question

Python Unicode and ASCII issues when parsing HTML

I am trying to write a Python script which acts similar to Ctrl + S on a Chrome web browser, it saves the HTML page, downloads any links on the webpage and finally, replaces the URIs of the links with the local path on disk.

The code posted below attempts to replace the URIs in for CSS files with local paths on my computer.

I have come across an issue when attempting to parse different sites, and it's becoming a bit of a headache.

The original error code I have is

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 13801: ordinal not in range(128)


url = 'http://www.s1jobs.com/job/it-telecommunications/support/edinburgh/620050561.html'

response = urllib2.urlopen(url)
webContent = response.read()
dest_dir = 'C:/Users/Stuart/Desktop/' + title
for f in glob.glob(r'./*.css'):
newContent = webContent.replace(cssUri, "./" + title + '/' + cssFilename)
shutil.move(f, dest_dir)


This issue persists either when I attempt to print newContent or write it to a file. I attempted to follow the top answer in this Stack question UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128) and modified my line

newContent = webContent.decode('utf-8').replace(cssUri, "./" + title + '/' + cssFilename)


to
newContent = webContent.decode(utf-8).replace(cssUri, "./" + title + '/' + cssFilename)
. I have also attempted
.decode(utf-16)
and 32 where I get these error codes respectively:
13801: invalid start byte
,
byte 0x0a in position 44442: truncated data
and finally
can't decode bytes in position 0-3: code point not in range(0x110000)


Does anyone have any idea to how I should remedy this issue? I must add that when I print variable webContent, there is output (I noticed Chinese writing at the bottom though).

Answer

THIS WILL SOLVE YOUR ISSUE

use webContent.decode('utf-8', errors='ignore') or webContent.decode('latin-1')

webContent[13801:13850] has some weird characters. Just ignore them.

IGNORE EVERYTHING BELOW HERE


This is kind of a shot at the dark, but try this :

At the top of your file,

from __future__ import unicode_literals
from builtins import str

It appears what's happening is that you're attempting to decode a python object that is probably a python 2.7 str object, which in principle, should be some decoded text object.

Brief Explanation

In the default python 2.7 kernel :

(iPython session)

In [1]: type("é") # By default, quotes in py2 create py2 strings, which is the same thing as a sequence of bytes that given some encoding, can be decoded to a character in that encoding.
Out[1]: str

In [2]: type("é".decode("utf-8")) # We can get to the actual text data by decoding it if we know what encoding it was initially encoded in, utf-8 is a safe guess in almost every country but Myanmar.
Out[2]: unicode

In [3]: len("é") # Note that the py2 `str` representation has a length of 2.  There's one byte for the "e" and one byte for the accent.  
Out[3]: 2

In [4]: len("é".decode("utf-8")) # the py2 `unicode` representation has length 1, since an accented e is a single character
Out[4]: 1

Some other things of note in python 2.7:

  • "é" is the same thing as str("é")
  • u"é" is the same thing as "é".decode('utf-8') or unicode("é", 'utf-8')
  • u"é".encode('utf-8') is the same thing as str("é")
  • You typically call decode with a py2 str, and encode with py2 unicode.
    • Due to early design issues, you can call both on either even though that doesn't really make any sense.
    • In python3, str, which is the same as python2 unicode, can no longer be decoded since a string by definition is a decoded sequence of bytes. By default, it uses the utf-8 encoding.
  • Byte sequences that were encoded with in the ascii codec behave exactly the same as their decoded counterparts.
    • In python 2.7 with no future imports : type("a".decode('ascii')) gives a unicode object, but this behaves nearly identically with str("a"). This is not the case in python3.

With that said, here's what the snippets above do :

  • __future__ is a module maintained by the core python team that backports python3 functionality to python2 to allow you to use python3 idioms within python2.
  • from __future__ import unicode_literals has the following effect :
    • Without the future import "é" is the same thing as str("é")
    • With the future import "é" is functionally the same thing as unicode("é")
  • builtins is a module that is approved by the core python team, and contains safe aliases for using python3 idioms in python2 with the python3 api.
    • Due to reasons beyond me, the package itself is named "future", so to install the builtins module you run : pip install future
  • from builtins import str has the following effect :
    • the str constructor now gives what you think it does, i.e. text data in the form of python2 unicode objects. So it's functionally the same thing as str = unicode
    • Note : Python3 str is functionally the same as Python2 unicode
    • Note : To get bytes, you can use the "bytes" prefix, e.g. b'é'

The takeaway is this :

  1. Decode on read/Decode early on and encode on write/encode at the end
  2. Use str objects for bytes and unicode objects for text