J4N J4N - 2 months ago 20
C# Question

Retrieve web page content like a browser

After I learned some things about differents technologies, I wanted to make a small project using UWP+NoSQL. I wanted to do a small UWP app that grabs the horoscope and display it on my raspberry Pi every morning.

So I took a

WebClient
, and I do the following:

WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");


But it seems that it detect that this request isn't coming from a browser, since the interesting part is not in the content(and when I check with the browser, it is in the initial HTML, according to fiddler).

I also tried with ScrapySharp but I got the same result. Any idea why?

(I've already done the UWP part, so I don't want to change the topic of my personal project just because it is detected as a "bot")

EDIT

It seems I wasn't clear enough. The issue is **not* that I'm unable to parse the HTML, the issue is that I don't receive expected HTML when using ScrapySharp/WebClient

EDIT2

Here is what I retrieve: http://pastebin.com/sXi4JJRG

And, I don't get(by example) the "Star ratings by domain" + the related images for each stars

Answer

Ok, I think I know what's going on: I compared the real output (no fancy user agent strings) to the output as supplied by your pastebin and found something interesting. On line 213, your pastebin has:

<li class="dropdown"><a href="/us/profiles/zodiac/index-profile-zodiac-sign.aspx" class="dropdown-toggle" data-hov...ck">Forecast Tarot Readings</div>

Mind the data-hov...ck near the end. In the real output, this was:

<li class="dropdown"><a href="/us/profiles/zodiac/index-profile-zodiac-sign.aspx" class="dropdown-toggle" data-hover="dropdown" data-toggle="link">Astrology</a>

followed by about 600 lines of code, including the aforementioned 'interesting part'. On line 814, it says:

<div class="bot-explore-col-subtitle f14 blocksubtitle black">Forecast Tarot Readings</div>

which, starting with the ck in black, matches up with the rest of the pastebin output. So, either pastebin has condensed the output or the original output was.

I created a new console application, inserted your code, and got the result I expected, including the 600 lines of html you seem to miss:

static void Main(string[] args)
{
    WebClient client = new WebClient();
    client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2";
    string downloadString = client.DownloadString("http://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=2");

    File.WriteAllText(@"D:\Temp\source-mywebclient.html", downloadString);
}

My WebClient is from System.Net. And changing the UserAgent hardly has any effect, a couple of links are a bit different.

So, to sum it up: Your problem has nothing to do with content that is inserted dynamically after the initial get, but possibly with webclient combined with UWP. There's another question regarding webclient and UWP on the site: (UWP) WebClient and downloading data from URL in that states you should use HttpClient. Maybe that's a solution?

Comments