Simsons Simsons - 2 months ago 5
Python Question

Regular Expression in Python for Removing XML Comments and HTML elements

I am parsing RSS content using Universal feed Parser. In the description tag some times
I am getting velues like below:

<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<m:Table>Sampe Text</m:Table>


Inorder to remove HTML elements/tags I am using the following Regex.

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE)
desc = pattern.sub(u" ", desc)


This helps to remove the HTML tags but not the xml comments. How do I remove both the elemnts and XML coments?

Answer

Using lxml:

import lxml.html as LH

content='''
<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<Table>Sampe Text</Table>
'''

doc=LH.fromstring(content)
print(doc.text_content())

yields

This is a Test Paragraph
Sample Bold
Sampe Text