Sean de Hoon Sean de Hoon - 2 months ago 26x
R Question

How do I scrape information from website source code/html using R?

I am pretty new to webscraping and I am trying to build a scraper that accesses information in the website's source code/html using R.

Specifically, I want to be able to determine whether a (number of) website(s) has an id with a certain text: "google_ads_iframe". The id will always be longer than this, so I think I will have to use a wildcard.

I have tried several options (see below), but so far nothing has worked.

1st method:

doc <- htmlTreeParse("")

data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)

Error message reads:

Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"

2nd method:

scrapestuff <- scrape(url = "", parse = T, headers = T)

x <- xpathSApply(scrapestuff[[1]],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)

x returns as an empty list.

3rd method:

scrapestuff <- read_html("")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)

Again, x is returned as an empty list.

I can't figure out what I am doing wrong, so any help would be really great!


My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. Let the optimized C functions in libxml2 (which underpins rvest and the xml2 package) do the work for you and just wrap your XPath with boolean():


pg <- read_html("")

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE

One other issue you'll need to deal with is that the google ads iframes are most likely being generated after page-load with javascript, which means using RSelenium to grab the page source (you can then use this method with the resultant page source).


I found a page example with google_ads_iframe in it:

pg <- read_html("")

xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE

xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3

That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox):

phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)

raw_html <- remDr$getPageSource()[[1]]

pg <- read_html()

# eventually (when done)


The XPath I used with the codepen example (since it has a google ads iframe) was necessary. Here's the snippet where the iframe exists:

<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
  <script type="text/javascript">
  googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
  <iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:&quot;<html><body style='background:transparent'></body></html>&quot;" style="border: 0px; vertical-align: bottom;"></iframe></div>

The iframe tag is a child of the div so if you want to target the div first you then have to add the child target if you want to find an attribute in it.