Morteza Ezzabady Morteza Ezzabady - 8 days ago 8
Python Question

lxml.html5parser: not working for arabic/persian html5s

I'm using lxml's html5parser
it's okay with ascii characters, but if I download an html file which has persian and russian characters inside it, this error will appear:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 418: ordinal not in range(128)


this is the response text: http://paste.ubuntu.com/23552349/

and this is my code (as you see I have only removed all non valid xml characters):

f = requests.post('http://www.example.com/getHtml.php?', headers=headers, cookies=cookies, data=data)
resp = f.text
if resp == "":
return []
resp = encode("utf-8")
resp = ''.join(c for c in resp if valid_xml_char_ordinal(c))
doc = html5parser.fragment_fromstring(resp.encode("utf-8"), guess_charset=False, create_parent='div')



  • if I remove the line: resp = encode("utf-8") this error will appear:

    ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters


Answer

I also face some strange inconsistencies when using html5parser directly (TypeError: __init__() got an unexpected keyword argument 'useChardet' and things like that).

If you already installed lxml, working with the BeautifulSoup wrapper is a joy.

First install BeautifulSoup (pip install beautifulsoup4). Then:

import requests
from bs4 import BeautifulSoup

# (initialize headers, cookies and data)

f = requests.post('http://www.example.com/getHtml.php?', headers=headers, cookies=cookies, data=data)
resp = f.text
if not resp:
    return []
doc = BeautifulSoup(resp, 'lxml')

Then you can use the BeautifulSoup clean API to manipulate the HTML tree. Under the hood it still uses lxml for parsing.

Ref for BeautifulSoup API: https://www.crummy.com/software/BeautifulSoup/bs4/doc/