I am trying to find the index of Mauricio in a string that is downloaded from a website using webclient and download string. However, on the website it contains a foreign character, Maurício. So I found elsewhere some code
string ToASCII(string s)
.Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
wc.Encoding = System.Text.Encoding.UTF8;
DownloadString doesn't look at HTTP response headers. It uses the previously set WebClient.Encoding property. If you have to use it, get the headers first:
// call twice // (or to just do a HEAD, see http://stackoverflow.com/questions/3268926/head-with-webclient) webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício"); var contentType = webClient.ResponseHeaders["Content-Type"]; var charset = Regex.Match(contentType,"charset=([^;]+)").Groups.Value; webClient.Encoding = Encoding.GetEncoding(charset); var s = webClient.DownloadString("http://en.wikipedia.org/wiki/Maurício");
BTW--Unicode doesn't define "foreign" characters. From Maurício's perspective, "Mauricio" would be the foreign spelling of his name.