I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.
in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.
I have recently added http headers as below. Previously the same code was called without the headers to the same effect. I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.
The characters, or character sequences that I refer to are:
These characters are not seen when you use "view source" in a web browser. What could be causing this and how can I rectify the problem?
string urlData = String.Empty;
WebClient wc = new WebClient();
// Add headers to impersonate a web browser. Some web sites
// will not respond correctly without these headers
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:18.104.22.168) Gecko/20101026 Firefox/3.6.12");
urlData = wc.DownloadString(uri);
ï»¿ is the windows-1252 representation of the octets
EF BB BF. That's the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. According to the docs,
Webclient.Encoding as its encoding when it converts the remote resource into a string. Set it to
System.Text.Encoding.UTF8 and things should theoretically work.