sapbucket sapbucket - 1 month ago 21
C# Question

How to completely download page source, instead of partial download?

I'm scraping dynamic data from a website. For some reason the PageSource that I get() is partial. However, it is not partial when I view the page source directly from Chrome or Firefox browsers. I would like to know an answer that will enable me to completely scrape the data from the page.

For my application, I want to scrape programmatically using a .Net web browser or similar. I've tried using Selenium WebDriver 2.48.2 with ChromeDriver; I've also tried PhantomJSDriver; I've also tried WebClient; and also HttpWebRequest. All with .Net 4.6.1.

The url: http://contests.covers.com/KingOfCovers/Contestant/PendingPicks/ARTDB

None of the following are working...

Attempt #1: HttpWebRequest

var urlContent = "";

try
{
var request = (HttpWebRequest) WebRequest.Create(url);
request.CookieContainer = new CookieContainer();
if (cookies != null)
{
foreach (Cookie cookie in cookies)
{
request.CookieContainer.Add(cookie);
}
}

var responseTask = Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse,request.EndGetResponse,null);

using (var response = (HttpWebResponse)await responseTask)
{

if (response.Cookies != null)
{
foreach (Cookie cookie in response.Cookies)
{
cookies.Add(cookie);
}
}

using (var sr = new StreamReader(response.GetResponseStream()))
{
urlContent = sr.ReadToEnd();
}
}


Attempt #2: WebClient

// requires async method signature
using (WebClient client = new WebClient())
{
var task = await client.DownloadStringTaskAsync(url);

return task;
}


Attempt #3: PhantomJSDriver

var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.HideCommandPromptWindow = true;
using (var driver = new PhantomJSDriver(driverService))
{
driver.Navigate().GoToUrl(url);

WaitForAjax(driver);

string source = driver.PageSource;

return source;
}

public static void WaitForAjax(PhantomJSDriver driver)
{
while (true) // Handle timeout somewhere
{
var ajaxIsComplete = (bool)(driver as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0");
if (ajaxIsComplete)
break;
Thread.Sleep(100);
}
}


I also tried ChromeDriver using page object model. That code is too long to paste here; nonetheless: it has the exact same result as the other 3 attempts.

Expected Results

The data table from the url is complete, without any missing data. For example, here is a screenshot to compare to the screen shot below. The thing to observe is that there is NOT an "...". Instead there is the data. This can reproduced by opening the url in Firefox or Chrome, right click, and View Page Source.

enter image description here

Actual Results

Observe that where the "..." is a big gap, as the arrow indicates in the screen shot. There should be many rows of content in place of that "...". This can be reproduced using any of the above attempts above.

enter image description here

Please note that the url is dynamic data. You will likely not see the exact same results as the screen shots. Nonetheless, the exercise can be repeated it will simply look different than the screen shots. A quick test to understand that there is missing data is to compare the Page Source line count: the "complete" data set will have nearly twice as many rows in the html.

Answer

Ok, as requested. glad to have helped. :)

But in your C# were are you copying from?, in your code you have -> urlContent = sr.ReadToEnd(); How are you seeing, copying the result from this?. Are you copying from the debugger?, if so it's maybe the object inspector of the debugger that's trimming. Have you tried getting the result from urlContent and saving to file?. eg. System.IO.File.WriteAllText(@"temp.txt",urlContent);