Richard Neish Richard Neish - 1 month ago 32
Python Question

How can I replace or remove HTML entities like " " using BeautifulSoup 4

I am processing HTML using Python and the BeautifulSoup 4 library and I can't find an obvious way to replace

 
with a space. Instead it seems to be converted to a Unicode non-breaking space character.

Am I missing something obvious? What is the best way to replace   with a normal space using BeautifulSoup?

Edit to add that I am using the latest version, BeautifulSoup 4, so the
convertEntities=BeautifulSoup.HTML_ENTITIES
option in Beautiful Soup 3 isn't available.

Answer

See Entities in the documentation. BeautifulSoup 4 produces proper Unicode for all entities:

An incoming HTML or XML entity is always converted into the corresponding Unicode character.

Yes,   is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.