Sean de Hoon Sean de Hoon - 3 months ago 37
R Question

How do I scrape information from website source code/html using R?

I am pretty new to webscraping and I am trying to build a scraper that accesses information in the website's source code/html using R.

Specifically, I want to be able to determine whether a (number of) website(s) has an id with a certain text: "google_ads_iframe". The id will always be longer than this, so I think I will have to use a wildcard.

I have tried several options (see below), but so far nothing has worked.

1st method:

doc <- htmlTreeParse("http://www.funda.nl/")

data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)


Error message reads:

Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"


2nd method:

scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)

x <- xpathSApply(scrapestuff[[1]],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)


x returns as an empty list.

3rd method:

scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)


Again, x is returned as an empty list.

I can't figure out what I am doing wrong, so any help would be really great!

Answer

My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. Let the optimized C functions in libxml2 (which underpins rvest and the xml2 package) do the work for you and just wrap your XPath with boolean():

library(xml2)

pg <- read_html("http://www.funda.nl/")

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE

One other issue you'll need to deal with is that the google ads iframes are most likely being generated after page-load with javascript, which means using RSelenium to grab the page source (you can then use this method with the resultant page source).

UPDATE

I found a page example with google_ads_iframe in it:

pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")

xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE

xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3

That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox):

library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()

remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]

pg <- read_html()
...

# eventually (when done)
phantom_js$stop()

NOTE

The XPath I used with the codepen example (since it has a google ads iframe) was necessary. Here's the snippet where the iframe exists:

<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
  <script type="text/javascript">
  googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
  </script>
  <iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:&quot;<html><body style='background:transparent'></body></html>&quot;" style="border: 0px; vertical-align: bottom;"></iframe></div>

The iframe tag is a child of the div so if you want to target the div first you then have to add the child target if you want to find an attribute in it.

Comments