Richard Richard - 1 year ago 102
Python Question

How to remove HTML tags in BeautifulSoup when I have contents

I have the contents of an

tag below:

initialPage = beautifulSoup(urllib.urlopen(url).read(), 'html.parser')
deviceInfo = initialPage.find('div', {'id': 'unitType'}).h2.contents
print('Device Info: ', deviceInfo)
for i in deviceInfo:
print i

Which outputs:

('Device Info: ', [u'BB100 ', <br>v1.4.3</br>])

How do I remove the
html tags, using BeautifulSoup rather than regex? I've tried
but neither has worked. It would throw
'NoneType' object is not callable

Answer Source

You can check if the element is a <br> tag with if == 'br', and then just change the list to have the contents instead.

for i in deviceInfo:
    if == 'br':
        i = i.contents

If you need to iterate over it many times, modify the list.

for n, i in enumerate(deviceInfo):
    if == 'br':
        i = i.contents
        deviceInfo[n] = i