d.moncada d.moncada - 4 years ago 84
C# Question

UTF-8 Byte Mark check gives different value based on operating system

We have some unit-tests that are checking UTF-8 byte marking of an XML string before it's loaded into an XmlDocument. Everything works fine using Windows 7 64-bit, but we noticed a bunch of tests failing while trying to run under Windows 10 64-bit.

After a bit of investigation, we found that the XML string on Windows 10 is getting pruned (the preamble exists), while on Windows 7 it does not.

Here is the code snippet:

public static string PruneUtf8ByteMark(string xmlString)
{
var byteOrderMarking = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (xmlString.StartsWith(byteOrderMarking))
{
xmlString = xmlString.Remove(0, byteOrderMarking.Length);
}

return xmlString;
}


StartsWith
is returning true for Windows 10, and false for Windows 7. Note that the same XML string is being used, the only difference here is the OS.

Any ideas? We are a bit lost here, since both PCs are x64 running the same .NET version.

edit:
The string comes from a class via:

public static string XmlString = "<?xml version=\"1.0\"....


On Windows 10, the less than sign gets truncated because the byte mark check is true.

Answer Source

The problem is cause by culture sensitive comparison.

The byteOrderMarking is not a visible character so it will be trimmed during comparison.

See the following case :

"".StartsWith("") // = true
"aa".StartsWith("") // = true 
"aa".StartsWith("", StringComparison.Ordinal) // = true

So every string start with an empty string. Now with byteOrderMarking :

var byteOrderMarking = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
byteOrderMarking.Equals("") // = False
byteOrderMarking.Equals("", StringComparison.CurrentCulture) // = True
byteOrderMarking.Equals("", StringComparison.Ordinal) // = False

Now we can see that byteOrderMarking is equal to an empty string only with Current culture comparison. When you try to check is a string start with byteOrderMarking, it's like to compare to an an empty string.

The difference between Ordinal and CurrentCulture is that the first is a byte to byte comparison, whereas the second will by normalize according to the culture.

Lastly, I suggest to always use Ordinal (or OrdinalIgnoreCase) to compare technical strings.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download