Kumakaja Kumakaja - 12 days ago 5
HTML Question

BeutifulSoap4 and escaped data in html

Html I load into BeutifulSoap4 is in this format:

\\u003C/span\\u003E\\u003Ca href=\\"javascript:void(0)\\" class=\\"something something22\\"\\u003EShowMore\\u003C/a\\u003E\\u003C/span\\u003E\\u003Cspan style=\\"display:none\\" class=\\"review-full-text\\"\\u003ESomething else....


And because of this BeutifulSoap4 can't find html tags, for example, it normally is able to:

bsoup1.find_all("div", class_="some_class")


Is there a standard way to fix that?

Answer

You can try unicode_escape encoding

data = '\\u003C/span\\u003E\\u003Ca href=\\"javascript:void(0)\\" class=\\"something something22\\"\\u003EShowMore\\u003C/a\\u003E\\u003C/span\\u003E\\u003Cspan style=\\"display:none\\" class=\\"review-full-text\\"\\u003ESomething'

print(data.encode('utf-8').decode('unicode_escape'))

7.2.4. Python Specific Encodings

Comments