Gracos Gracos - 7 months ago 41
HTML Question

Web Scrape Non Farm Payrolls dates in R

I would like to web scrape past dates for Non Farm Payrolls from here http://www.bls.gov/bls/archived_sched.htm (archive) and here http://www.bls.gov/schedule/news_release/empsit.htm (current year).

Something similar was achieved by Peter Chan for FOMC dates here: https://github.com/returnandrisk/r-code/blob/master/FOMC%20Dates%20-%20Scraping%20Data%20From%20Web%20Pages.R. This is his code:

install.packages(c("httr", "XML"), repos = "http://cran.us.r-project.org")

library(httr)
library(XML)

# get and parse web page content
webpage <- content(GET("http://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"), as="text")
xhtmldoc <- htmlParse(webpage)
# get statement urls and sort them
statements <- xpathSApply(xhtmldoc, "//td[@class='statement2']/a", xmlGetAttr, "href")
statements <- sort(statements)
# get dates from statement urls
fomcdates <- sapply(statements, function(x) substr(x, 28, 35))
fomcdates <- as.Date(fomcdates, format="%Y%m%d")
# save results in working directory
save(list = c("statements", "fomcdates"), file = "fomcdates.RData")


I would like to replicate that for NFP. Just as fomcdates contains all FOMC dates, I would like to create NFPdates containing all NFP dates.

Would anyone know how to do so for the current year only? (asking current year as it seems to be the simplest). Thank you.

Answer

This works for the current year.

library(rvest)

url <- 'http://www.bls.gov/schedule/news_release/empsit.htm'
ses <- html_session(url)
tbl <- html_table(ses, fill = T) 
nfpdates <- tbl[[2]]$`Release Date`
nfpdates <- gsub('\\.', '', nfpdates)
nfpdates <- as.Date(nfpdates, '%b %d, %Y')
Comments