ksprague ksprague - 4 months ago 19
HTML Question

using c# remove duplicate html span elements

I have to convert word to html which I'm doing with Aspose and that is working well. The problem is that it is producing some redundant elements which I think is due to the way the text is store in word.

For example in my word document the text below appears:

AUTHORIZATION FOR RELEASE

When converted to html it becomes:

<span style="font-size:9pt">A</span>
<span style="font-size:9pt">UTHORIZATION FOR R</span>
<span style="font-size:9pt">ELEASE</span>


I'm using C# and would like a way to remove the redundant span elements. I'm thinking either AngleSharp or html-agility-pack should be able to do this but I'm not sure this is the best way?

Answer

What I wound up doing is iterating over all the elements and when adjacent span elements were detected I concatenated the text together. Here is some code if others run into this issue. Note code could use some cleanup.

static void CombineRedundantSpans(IElement parent)
{
  if (parent != null)
  {               
    if (parent.Children.Length > 1)
    {
      var children = parent.Children.ToArray();
      var previousSibling = children[0];
      for (int i = 1; i < children.Length; i++)
      {
        var current = children[i];
        if (previousSibling is IHtmlSpanElement && current is IHtmlSpanElement)
        {
          if (IsSpanMatch((IHtmlSpanElement)previousSibling, (IHtmlSpanElement)current))
          {
              previousSibling.TextContent = previousSibling.TextContent + current.TextContent;
              current.Remove();
           }
           else
             previousSibling = current;
         }
         else
           previousSibling = current;
       }
     }
     foreach(var child in parent.Children)
     {
       CombineRedundantSpans(child);
     }
   }
}
static bool IsSpanMatch(IHtmlSpanElement first, IHtmlSpanElement second)
{
  if (first.ChildElementCount < 2 && first.Attributes.Length == second.Attributes.Length)
  {
    foreach (var a in first.Attributes)
    {
      if (second.Attributes.Count(t => t.Equals(a)) == 0)
      {
        return false;
      }
    }
    return true;
  }
  return false;
}
Comments