ChemInformant ChemInformant - 6 months ago 11
HTML Question

Scraping tables isn't always easy

I just edited my question to make it more general:
"How to scrape a table using r, when the format is not covered in any r functions?"

First of all, how should I know if the format matches what r functions like

rvest
can extract?

Second, let's say I tried all available scraping functions and they failed, how should I proceed? Write a parsing function myself? Is there an easier way to do it?

If
readHTMLTable
can not work for this instance, what are other options I should pursue besides parsing the html code in a huge string manipulation?

Answer

I think the general answer to this is "scraping in any language is often a pain in the neck". This is because people put stuff on the web in random, crappy formats that are difficult for machines to parse.

I don't do an enormous amount of scraping, and don't have a better answer than "poke around in the source view of the page, use trial and error".

It looks like the table is badly structured; if you try to extract the <tr> (table row) you get junk ...

Weblink <- "http://hmofs.northwestern.edu/hc/crystals.php"
library(rvest)
rr <- read_html(Weblink)
tab2 <- html_nodes(rr,"table")[4]        ## get 4th table
vals <- html_text(html_nodes(tab2,"td")) ## get *all* elements in 4th table

Now take only the numeric values - the 7th column of the table is download information, and gets discarded this way

vals <- suppressWarnings(na.omit(as.numeric(vals)))
matrix(vals,byrow=TRUE,ncol=6)