I am pretty new to webscraping and I am trying to build a scraper that accesses information in the website's source code/html using R.
Specifically, I want to be able to determine whether a (number of) website(s) has an id with a certain text: "google_ads_iframe". The id will always be longer than this, so I think I will have to use a wildcard.
I have tried several options (see below), but so far nothing has worked.
doc <- htmlTreeParse("http://www.funda.nl/")
data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)
Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"
scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)
x <- xpathSApply(scrapestuff[],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)
scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)
My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. Let the optimized C functions in
libxml2 (which underpins
rvest and the
xml2 package) do the work for you and just wrap your XPath with
library(xml2) pg <- read_html("http://www.funda.nl/") xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])") ##  TRUE xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])") ##  FALSE
I found a page example with
google_ads_iframe in it:
pg <- read_html("http://codepen.io/anon/pen/Jtizx.html") xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])") ##  TRUE xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])") ##  3
That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox):
library(RSelenium) RSelenium::startServer() phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any")) capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3") remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities) remDr$open() remDr$navigate(URL) raw_html <- remDr$getPageSource()[] pg <- read_html() ... # eventually (when done) phantom_js$stop()
The XPath I used with the codepen example (since it has a google ads iframe) was necessary. Here's the snippet where the iframe exists:
iframe tag is a child of the
div so if you want to target the
div first you then have to add the child target if you want to find an attribute in it.