André André - 1 month ago 7x
C# Question

remove 4 byte UTF8 characters

I'd like to remove 4 byte UTF8 characters which starts with \xF0 (the char with the ASCII code 0xF0) from a string and tried

sText = Regex.Replace (sText, "\xF0...", "");

This doesn't work. Using two backslashes did not work neither.

The exact input is the content of The 4 byte character ist the one after the text "[[Violinschl├╝ssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..

What's wrong?


Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char values.

To just remove them, you can do (using System.Linq;):

sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));

(uses an overload of Concat introduced in .NET 4.0 (Visual Studio 2010)).

Late addition: It may give better performance to use:

sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());

even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)