Gracos Gracos - 7 months ago 14
HTML Question

Web Scrape from a table by Date and character string into R

I need to web scrape from here http://www.bls.gov/schedule/news_release/2015_sched.htm every Date that contains Employment Situation under the Release column. The web scrapping output should be the following:

Friday, January 09, 2015
Friday, February 06, 2015
Friday, March 06, 2015
Friday, April 03, 2015
Friday, May 08, 2015
Friday, June 05, 2015
Thursday, July 02, 2015
Friday, August 07, 2015
Friday, September 04, 2015
Friday, October 02, 2015
Friday, November 06, 2015
Friday, December 04, 2015


To achieve that I thought to repeat something like the following 12 times, once for each month. Note http://www.bls.gov/schedule/news_release/2015_sched.htm contains 12 tables, one for each month, named
tbl2[[2]]
,
tbl3[[3]]
, and so on.

library(rvest)
url <- 'http://www.bls.gov/schedule/news_release/2015_sched.htm'
ses <- html_session(url)
tbl <- html_table(ses, fill = T)
nfpdates <- tbl[[2]]$`Date`
nfpdates <- gsub('\\.', '', nfpdates)
nfpdates <- as.Date(nfpdates, 'weekdaystr(iD,:), %b %d, %Y')


It is not working. The first issue is simple: I do not know how to refer to the day of the week:
'weekdaystr(iD,:)
is wrong. The second is more complicated: how to extract only the text that contains "Employment Situation" under "Release"?

Any help would be greatly appreciated. Thank you.

Answer

This is a perfect use-case for XPath:

library(rvest)

pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")

# we need to target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")

# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

# clean up the cruft and make our dates!
as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")

##  [1] "2015-01-09" "2015-02-06" "2015-03-06" "2015-03-18" "2015-04-03"
##  [6] "2015-05-08" "2015-06-05" "2015-07-02" "2015-08-07" "2015-09-04"
## [11] "2015-10-02" "2015-11-06" "2015-12-04"