CL. L CL. L - 4 years ago 120
Python Question

Regex for Removing Duplicate HTML Tags in Python

I would like to reduce

<p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> abcabc </p><p>
</p><p> defdef </p><p>
 </p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p><p>
</p> xyzxyz


to

<p></p> abcabc </p><p>defdef</p><p></p> xyzxyz


I try:

str.replace('</p><p>+', '</p><p>')
and

re.sub('</p><p>+', '</p><p>', str)


Both no luck, any advise as to the way to do? Many thanks.

Answer Source

Alternative approach: you can solve it with an HTML parser, like BeautifulSoup. The idea is to find all p elements except the first one and remove them from the tree:

In [1]: from bs4 import BeautifulSoup

In [2]: data = "<p></p><p></p><p></p><p></p>"

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: for p in soup('p')[1:]:
   ...:     p.decompose()   

In [5]: print(soup)
<p></p>

Or, you can find the first p element and remove all the next p siblings:

In [6]: soup = BeautifulSoup(data, "html.parser")

In [7]: for p in soup.p.find_next_siblings('p'):
   ...:     p.decompose()  

In [8]: print(soup)
<p></p>

Updated solution for the updated problem (cleaning up p elements with an empty text):

In [10]: data = """<p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p> abcabc </p><p>
    ...: </p><p> defdef </p><p>
    ...:  </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p><p>
    ...: </p> xyzxyz"""

In [11]: soup = BeautifulSoup(data, "html.parser")

In [12]: for p in soup.find_all("p", text=lambda text: not text.strip()):
    ...:     p.decompose()
    ...:     

In [13]: print(soup)
 abcabc <p> defdef </p> xyzxyz
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download