Diego Br Diego Br - 8 months ago 40
HTML Question

Expand html collapsible lists automatically in R

I'm trying to scrape a bunch of information from a particular website that contains a list of different species belonging to Mollusca. The main page is http://emollusks.myspecies.info/taxonomy/term/8

Once I get to a particular species (e.g. http://emollusks.myspecies.info/taxonomy/term/12257), extracting the information itself is not a problem. However, if you navigate to the main page above, you'll realize that it contains a collapse-type menu starting in 'Mollusca'. Thus, to get to a particular species, I have to first expand that menu manually, save the .html page, and only then parse it in R using XML. I'd like to develop a R script that starts at the main page and automatically expands all possible boxes so I can later access information for each species at once. I have no idea where to start though.

Thanks very much for your assistance.

Here's a potential solution based on the chosen answer below by @hrbrmstr



tinyTaxUrl <- function(ID) {
sprintf('http://emollusks.myspecies.info/tinytax/get/%s', ID)

termTaxUrl <- function(ID) {
sprintf('http://emollusks.myspecies.info/taxonomy/term/%s', ID)

extractContent <- function(...) {
content(GET(url = tinyTaxUrl(...),
add_headers(Referer = termTaxUrl(...)),
set_cookies(has_js = "1")))[[2]]$data

readHtmlAndReturnTaxID <- function(...) {
pg <- read_html(extractContent(...))
taxaList <- pg %>% html_nodes('li > a')
data.frame(taxa = taxaList %>% html_text(),
ids = basename(taxaList %>% html_attr('href')),
stringsAsFactors = FALSE)

startTaxaID <- '8'
eBivalvia <- readHtmlAndReturnTaxID(startTaxaID)
eBivalvia2 <- ldply(eBivalvia$ids, readHtmlAndReturnTaxID)
n <- 1
while(nrow(eBivalvia2) > 0) {
cat(n, '\n')
n <- n + 1
eBivalvia <- rbind(eBivalvia, eBivalvia2)
eBivalvia2 <- ldply(eBivalvia2$ids, readHtmlAndReturnTaxID)

eBivalvia$urls <- termTaxUrl(eBivalvia$ids)



Open up Developer Tools and watch for the XHR request when you tick the [+]. You can feed the "Copy as cURL" right to my curlconverter package and it'll help you turn it into an httr request. Then you'll be able to grab the other species and their URLs from the data element XHR response:


cURL <- "curl 'http://emollusks.myspecies.info/tinytax/get/8' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: no-cache' -H 'X-Requested-With: XMLHttpRequest' -H 'Cookie: has_js=1' -H 'Connection: keep-alive' -H 'Referer: http://emollusks.myspecies.info/taxonomy/term/8' --compressed"

req <- make_req(straighten(cURL))
pg <- read_html(httr::content(req[[1]](), as="parsed")[[2]]$data)

html_nodes(pg, "li > a")

## {xml_nodeset (10)}
##  [1] <a href="/taxonomy/term/12" class="">Conchifera</a>
##  [2] <a href="/taxonomy/term/18" class="">Placophora</a>
##  [3] <a href="/taxonomy/term/9" class="">Bivalvia</a>
##  [4] <a href="/taxonomy/term/10" class="">Caudofoveata</a>
##  [5] <a href="/taxonomy/term/11" class="">Cephalopoda</a>
##  [6] <a href="/taxonomy/term/14" class="">Gastropoda</a>
##  [7] <a href="/taxonomy/term/16" class="">Monoplacophora</a>
##  [8] <a href="/taxonomy/term/19" class="">Polyplacophora</a>
##  [9] <a href="/taxonomy/term/20" class="">Scaphopoda</a>
## [10] <a href="/taxonomy/term/21" class="">Solenogastres</a>

Here's a modified version of the httr call that curlconverter generates:


GET(url = "http://emollusks.myspecies.info/tinytax/get/8", 
    add_headers(Referer = "http://emollusks.myspecies.info/taxonomy/term/8"), 
    set_cookies(has_js = "1"))

It should be possible to (eventually) grok the URL pattern and get whatever you need (you can poke around the Drupal tinytax module docs to get an idea for how it works, too).