ChemInformant ChemInformant - 9 months ago 31
HTML Question

Scraping tables isn't always easy

I just edited my question to make it more general:
"How to scrape a table using r, when the format is not covered in any r functions?"

First of all, how should I know if the format matches what r functions like

can extract?

Second, let's say I tried all available scraping functions and they failed, how should I proceed? Write a parsing function myself? Is there an easier way to do it?

can not work for this instance, what are other options I should pursue besides parsing the html code in a huge string manipulation?


I think the general answer to this is "scraping in any language is often a pain in the neck". This is because people put stuff on the web in random, crappy formats that are difficult for machines to parse.

I don't do an enormous amount of scraping, and don't have a better answer than "poke around in the source view of the page, use trial and error".

It looks like the table is badly structured; if you try to extract the <tr> (table row) you get junk ...

Weblink <- ""
rr <- read_html(Weblink)
tab2 <- html_nodes(rr,"table")[4]        ## get 4th table
vals <- html_text(html_nodes(tab2,"td")) ## get *all* elements in 4th table

Now take only the numeric values - the 7th column of the table is download information, and gets discarded this way

vals <- suppressWarnings(na.omit(as.numeric(vals)))