Basj Basj - 3 years ago 161
HTML Question

Clean email HTML body with too many tags

I would like to clean a lot of mails'HTML body which are a bit dirty (taken from Gmail-sent emails): there are lots of nested

<div>
, unwanted changes of fonts, etc.
I would like to clean this and keep only
<a>
,
<b>
,
<br>
,
<i>
,
<img>
, and nothing else
(and maybe also
<p>
or a few
<div>
if and only if it's really necessary).

With the regex
/<\/?(?!(a|br|b|img)\b)\w+[^>]*>/g
, it works most of the time:



document.onclick = function() {
document.body.innerHTML = document.body.innerHTML.replace(/<\/?(?!(a|br|b|img)\b)\w+[^>]*>/g, '');
}

<div dir="ltr"><div class="gmail_quote"><div dir="ltr">Hello,<div><br></div><div><div><div style="font-size:12.8px"><span style="font-size:12.8px">Thank you for your message.</span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">If the L<span class="m_-527331299899979m_70391001927gmail-il">orem</span>i</span><span class="m_-527331299899979m_703910001927gmail-m_2466414472930393055gmail-il" style="font-size:12.8px">psum</span><span style="font-size:12.8px"> bla bla </span><a href="http://example.com" style="font-size:12.8px" target="_blank">test</a><span style="font-size:12.8px"> window, then it will be like this.</span><br></div><div style="font-size:12.8px">Blah blah.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Lorem ipsum<span style="font-size:12.8px">lorem ipsum </span><span style="font-size:12.8px">blah blah and</span><span style="font-size:12.8px"> you can </span><span style="font-size:12.8px">also <i>blah blah</i> and finally <i>Blah</i>.</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">-----------</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">Examples:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test1</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test2</a></span></div><div><br></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test3</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div></div><div><br></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test5</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">ex<wbr>ample</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">exam<wbr>ple</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><br></div></div></div><div class="gmail_extra" style="font-size:12.8px"><div class="m_-52733129979m_703911927gmail-m_24664144055gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><span style="font-size:small">Sincerly,</span><br></div></div></div></div></div></div></div></div><div><div><div class="m_-52722719979m_7039100982345401927gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br></div><div>Myself<br></div><div dir="ltr"><br><b>example</b><br>web: <a href="http://www.example.com" target="_blank">www.example.com</a><br></div><div>fb: <a href="http://www.facebook.com/example/" target="_blank">www.facebook.com/LoremIp<wbr>sum/</a><br></div><div>mail: <a href="mailto:contact@example.com" target="_blank">contact@example.com</a><br></div><div dir="ltr"><br><img src="http://example.com/example.png"><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div>





(Click anywhere in the email after having run the Code Snippet to see what happens after the regex is applied)

Indeed:


  • unuseful tags
    <span>
    or
    </span>
    are successfully removed

  • <div fontstyle="...">
    and
    </div>
    are removed



But there is a remaining problem when removing
<div>
like this:


  • Empty lines are removed (see empty line between line 1 and 3 of the mail output, between line 3 and 5, etc.)

  • The newline is removed after each
    example: test1
    (see when you run Code Snippet)



I tried to replace
<div.*?><br></div>
by
<br><br>
but it's still not correct.

Question: How to clean this HTML code, discard the unwanted font changes, etc., and keep the same empty lines, and keep
<a>
,
<b>
,
<br>
,
<i>
,
<img>
tags?


Note: it has to finally run in a Google Apps Script, so I'm not sure it's possible to import third-party JS libraries...

Answer Source

The following 5-step process works for the sample you provided:

  1. At first passage, keep div tags, but remove all other unwanted tags.
  2. Replace <div><br></div> with <br><br>
  3. Replace any sequence of 1 or more closing </div> tags, possibly preceded by <br>, with a single <br>.
  4. Remove all div tags.
  5. Replace any sequence of 2 or more <br> rags with two <br> tags.

Code:

document.onclick = function() {
    document.body.innerHTML = document.body.innerHTML
                              .replace(/<\/?(?!(a|br|b|i|img|div)\b)\w+[^>]*>/g, '')
                              .replace(/<div[^>]*><br><\/div>/g, '<br><br>')
                              .replace(/((<br>)?<\/div>)+/g, '<br>')
                              .replace(/<div[^>]*>/g, '')
                              .replace(/(<br>){2,}/g, '<br><br>');
}
<div dir="ltr"><div class="gmail_quote"><div dir="ltr">Hello,<div><br></div><div><div><div style="font-size:12.8px"><span style="font-size:12.8px">Thank you for your message.</span><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><span style="font-size:12.8px">If the L<span class="m_-527331299899979m_70391001927gmail-il">orem</span>i</span><span class="m_-527331299899979m_703910001927gmail-m_2466414472930393055gmail-il" style="font-size:12.8px">psum</span><span style="font-size:12.8px"> bla bla </span><a href="http://example.com" style="font-size:12.8px" target="_blank">test</a><span style="font-size:12.8px"> window, then it will be like this.</span><br></div><div style="font-size:12.8px">Blah blah.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Lorem ipsum<span style="font-size:12.8px">lorem ipsum </span><span style="font-size:12.8px">blah blah and</span><span style="font-size:12.8px"> you can </span><span style="font-size:12.8px">also <i>blah blah</i> and finally <i>Blah</i>.</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">-----------</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div style="font-size:12.8px"><span style="font-size:12.8px">Examples:</span></div><div style="font-size:12.8px"><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test1</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test2</a></span></div><div><br></div><div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test3</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div></div><div><br></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test4</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">test5</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">ex<wbr>ample</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">example</a></span></div><div><span style="font-size:12.8px">example: <a href="http://example.com" target="_blank">exam<wbr>ple</a></span></div><div><span style="font-size:12.8px"><br></span></div><div><br></div></div></div><div class="gmail_extra" style="font-size:12.8px"><div class="m_-52733129979m_703911927gmail-m_24664144055gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><span style="font-size:small">Sincerly,</span><br></div></div></div></div></div></div></div></div><div><div><div class="m_-52722719979m_7039100982345401927gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br></div><div>Myself<br></div><div dir="ltr"><br><b>example</b><br>web: <a href="http://www.example.com" target="_blank">www.example.com</a><br></div><div>fb: <a href="http://www.facebook.com/example/" target="_blank">www.facebook.com/LoremIp<wbr>sum/</a><br></div><div>mail: <a href="mailto:contact@example.com" target="_blank">contact@example.com</a><br></div><div dir="ltr"><br><img src="http://example.com/example.png"><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div>

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download