user2496550 user2496550 - 5 months ago 22
Python Question

Stripping HTML tags form string keeping/removing the text in between

I'd like to clean up some html in python 3 where I used some span tags to mark inserted text with a color and strikethrough deleted text. An example:

<p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. <span class="inserted">
Lorem ipsum</span> Lorem ipsum dolor sit amet, consetetur
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut
labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum.
<span class="strikethrough">Lorem ipsum</span> lorem
<span class="inserted">ipsum</span>. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.</p>


What I'd like to do is remove the span tags keeping the text between span tags with the class 'inserted' and deleting the text between span tags 'strikethrough'.

I found this to strip the tags keeping the text between:

from html.parser import HTMLParser

class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()


But I'd like to remove the text between the span tags if the tag has a special class ('strikethrough').

How can I do that?

Answer

You are almost right. You just need to use the handle_starttag() and handle_endtag() methods and some variable to keep track of the current state.

How about this:

from html.parser import HTMLParser


class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True

        self._forbidden = False
        self._result = []

    def handle_starttag(self, tag, attrs):
        if tag in ['span']:
            if 'strikethrough' in [a for _, a in attrs]:
                self._forbidden = True

    def handle_endtag(self, tag):
        self._forbidden = False

    def handle_data(self, data):
        if not self._forbidden:
            self._result.append(data)


st = MLStripper()
st.feed('''
<p>Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. <span class="inserted">
Lorem ipsum</span> Lorem ipsum dolor sit amet, consetetur
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut
labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum.
<span class="strikethrough">Lorem ipsum</span> lorem
<span class="inserted">ipsum</span>. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.</p>
''')

print(''.join(st._result))

The result:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua.
Lorem ipsum Lorem ipsum dolor sit amet, consetetur
sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut
labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum.
 lorem
ipsum. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.