Richard Richard - 1 month ago 23
Python Question

How to remove HTML tags in BeautifulSoup when I have contents

I have the contents of an

h2
tag below:

initialPage = beautifulSoup(urllib.urlopen(url).read(), 'html.parser')
deviceInfo = initialPage.find('div', {'id': 'unitType'}).h2.contents
print('Device Info: ', deviceInfo)
for i in deviceInfo:
print i


Which outputs:

('Device Info: ', [u'BB100 ', <br>v1.4.3</br>])
BB100
<br>v1.4.3</br>


How do I remove the
<br>
and
</br>
html tags, using BeautifulSoup rather than regex? I've tried
i.decompose()
and
i.strip()
but neither has worked. It would throw
'NoneType' object is not callable
.

Answer

You can check if the element is a <br> tag with if i.name == 'br', and then just change the list to have the contents instead.

for i in deviceInfo:
    if i.name == 'br':
        i = i.contents

If you need to iterate over it many times, modify the list.

for n, i in enumerate(deviceInfo):
    if i.name == 'br':
        i = i.contents
        deviceInfo[n] = i