I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t
The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.
My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?
Python calling code:
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called
my_unicode... normalizing à into a is this simple...
import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
>>> myfoo = u'àà' >>> myfoo u'\xe0\xe0' >>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore') 'aa' >>>