bluescreenofdeath2016 bluescreenofdeath2016 - 1 month ago 13
Python Question

How do I encode specific characters to HTML in python

I'm scraping Wikipedia using BeautifulSoup4 in Python.

data = urllib2.urlopen(wikiurl)
soup = BeautifulSoup(data, 'html.parser')


I then use

for link in soup.find_all('p'):
completehtml = completehtml + str(link)


To get the HTML for a few paragraphs (The for loop has a break condition using a counter that counts the number of paragraphs and then breaks if they reach the limit)

Now after this data has been scraped. I need to enter it at a website online. (I need to enter it using the HTML which is scrapped). The problem is that some of the characters such as en-dash are not in proper HTML i.e coded in HTML, which is causing symbols to appear instead.

They print out fine in Python. But when I use methods such as pyautogui or the ActionChains class to send keys and thereby enter them using the scrapped string, they are entered as symbols.

How do I fix this. Looking for a solution in Python.

EDIT:
Okay, so the main issue is when non-ascii characters are in the scrapped html.
They're decoded to 'latin-1' when they're copied to clipboard or entered using the send keys method by python.

Answer

I believe the solution to this post would give you what you need: Convert HTML entities to Unicode and vice versa