The Dag The Dag - 1 month ago 7
C# Question

Mysterious failure removing nodes from an XML document

I'd be surprised if anyone can explain this, but it'd be interesting to know if others can reproduce the weirdness I'm experiencing...

We've got a thing based on InfoPath that processes a lot of forms. Form data should conform to an XSD, but InfoPath keeps adding its own metadata in the form of so-called "my-fields". We would like to remove the my-fields, and I wrote this simple method:

string StripMyFields(string xml)
{
var doc = new XmlDocument();
doc.LoadXml(xml);

var matches = doc.SelectNodes("//node()").Cast<XmlNode>().Where(n => n.NamespaceURI.StartsWith("http://schemas.microsoft.com/office/infopath/"));
Dbug("Found {0} nodes to remove.", matches.Count());
foreach (var m in matches)
m.ParentNode.RemoveChild(m);

return doc.OuterXml;
}


Now comes the really weird stuff! When I run this code it behaves as I expect it to, removing any nodes that are in InfoPath namespaces. However, if I comment out the call to Dbug, the code completes, but one "my-field" remains in the XML.

I've even commented out the content of the convenient Dbug method, and it still behaves this same way:

void Dbug(string s, params object[] args)
{
//if (args.Length > 0)
// s = string.Format(s, args);
//Debug.WriteLine(s);
}


Input XML:

<?xml version="1.0" encoding="UTF-8"?>
<skjema xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2008-03-03T22:25:25" xml:lang="en-us">
<Field-1643 orid="1643">data.</Field-1643>
<my:myFields>
<my:field1>Al</my:field1>
<my:group1>
<my:group2>
<my:field2 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">2009-01-01</my:field2>
<Field-1611 orid="1611">More data.</Field-1611>
<my:field3>true</my:field3>
</my:group2>
<my:group2>
<my:field2>2009-01-31</my:field2>
<my:field3>false</my:field3>
</my:group2>
</my:group1>
</my:myFields>
<Field-1612 orid="1612">Even more data.</Field-1612>
<my:field3>Blah blah</my:field3>
</skjema>


The "my:field3" element (at the bottom, text "Blah blah") is not removed unless I invoke Dbug.

Clearly the universe is not supposed to be like this, but I would be interested to know if others are able to reproduce.

I'm using VS2012 Premium (11.0.50727.1 RTMREL) and FW 4.5.50709 on Win8 Enterprise 6.2.9200.

Answer

First things first. LINQ uses concept known as deferred execution. This means no results are fetched until you actually materialize query (for example via enumeration).

Why would it matter with your nodes removal issue? Let's see what happens in your code:

  1. SelectNodes creates XPathNodeIterator, which is used by XPathNavigator which feeds data to XmlNodeList returned by SelectNodes
  2. XPathNodeIterator walks xml document tree basing on XPath expression provided
  3. The Cast and Where simply decide whether node returned by XPathNodeIterator should participate in final result

We arrive right before DBug method call. For a moment, assume it's not there. At this point, nothing have actually happened just yet. We only got unmaterialized LINQ query.

Things change when we start iterating. All the iterators (Cast and Where got their own iterators too) start rolling. WhereIterator asks CastIterator for item, which then asks XPathNodeIterator which finally returns first node (Field-1643). Unfortunately, this one fails the Where test, so we ask for next one. More luck with my:myFields, it is a match - we remove it.

We quickly proceed to my:field1 (again, WhereIteratorCastIteratorXPathNodeIterator), which is also removed. Stop here for a moment. Removing my:field1 detaches it from its parent, which results in setting its (my:field1) siblings to null (there's no other nodes before/after removed node).

What's the current state of things? XPathNodeIterator knows its current element is my:field1 node, which just got removed. Removed as in detached from parent, but iterator still holds reference. Sounds great, let's ask it for next node. What XPathNodeIterator does? Checks its Current item, and asks for NextSibling (since it has no children to walk first) - which is null, given we just performed detachment. And this means iteration is over. Job done.

As a result, by altering collection structure during iteration, you only removed two nodes from your document (while in reality only one, as the second removed node was child of the one already removed).

Same behavior can be observed with much simpler XML:

<Root>
    <James>Bond</James>
    <Jason>Bourne</Jason>
    <Jimmy>Keen</Jimmy>
    <Tom />
    <Bob />
</Root>

Suppose we want to get rid of nodes starting with J, resulting in document containing only honest man names:

var doc = new XmlDocument();
doc.LoadXml(xml);

var matches = doc
    .SelectNodes("//node()")
    .Cast<XmlNode>()
    .Where(n => n.Name.StartsWith("J"));

foreach (var node in matches)
{
    node.ParentNode.RemoveChild(node);
}

Console.WriteLine(doc.InnerXml);

Unfortunately, Jason and Jimmy remain. James' next sibling (the one to be returned by iterator) was originally meant to be Jason, but as soon as we detached James from tree there's no siblings and iteration ends.

Now, why it works with DBug? Count call materializes query. Iterators have run, we got access to all nodes we need when we start looping. The same things happens with ToList called right after Where or if you inspect results during debug (VS even notifies you inspecting results will enumerate collection).