Newbie_R Newbie_R - 6 months ago 14
HTML Question

Extracting all (possible) optional date values from web page [R]

In this url string, "toDate=1399849199999" part of the string refers to UNIX time expressed in milliseconds which is used to extract the Premier league table for a particular day.

In this case, UNIX time refers to 11. may of 2014.

as.POSIXlt (1399849199999/1000, tz = "GMT", origin = "1970-01-01")


I would like to retrieve all possible UNIX time values for a particular month. For url provided here, those 6 values are stored in webpage source code and it looks like this:

<select name="toDate" id="date" class="selectToSlider" widget="selectToSlider" labels="18" tooltip="false" wrapperClass="selectToSliderWrapper selectToSliderMatchDate"><optgroup label="results"><option value="1399157999999">SAT 03</option><option value="1399244399999">SUN 04</option><option value="1399330799999">MON 05</option><option value="1399417199999" selected="selected">TUE 06</option><option value="1399503599999">WED 07</option><option value="1399849199999">SUN 11</option></optgroup><optgroup label="fixtures"></optgroup></select>


Previously I used to extract such information with regular expressions but it was the pain in the neck (***) and I want to do this in some easier way.

I appreciate if someone can provide the code (possibly with explained steps) that can extract those values using some web scraping packages in R, preferably XML. I tried it by myself but I was unsuccessful...

Answer

rvest makes this pretty easy. Look for the "option" nodes, then grab the "value" attributes.

library("rvest")
h <- read_html('<select name="toDate" id="date" class="selectToSlider" widget="selectToSlider" labels="18" tooltip="false" wrapperClass="selectToSliderWrapper selectToSliderMatchDate"><optgroup label="results"><option value="1399157999999">SAT 03</option><option value="1399244399999">SUN 04</option><option value="1399330799999">MON 05</option><option value="1399417199999" selected="selected">TUE 06</option><option value="1399503599999">WED 07</option><option value="1399849199999">SUN 11</option></optgroup><optgroup label="fixtures"></optgroup></select>')
h %>% html_nodes("option") %>% html_attr("value")
[1] "1399157999999" "1399244399999" "1399330799999"
[4] "1399417199999" "1399503599999" "1399849199999"