Mike Edinger Mike Edinger - 1 year ago 68
R Question

Reading table from https webpage using readHTMLTable

I have R 3.3.1 installed and am using RStudio 0.99.903. I'm trying to read a table into R from the following URL: https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php

(I'm well aware that there is a download button, however, that is not an option for me right now)

I was able to do this easily last year with the readHTMLTable function. However, in that time, that site changed from using http to https, which causes an "XML content does not eem to be XML" error.

I tried what was suggested here: get url table into a `data.frame` R-XML-RCurl

url <- getURL("https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php")
df <- readHTMLTable(URL, header = T)

The get URL function returns a large string that is essentially meaningless to me, which means readHTMLTable doesn't work correctly (I get a list, with a couple of data frames, but those are also meaningless to me. It's in Spanish with observations of things that I have no idea where they came from):

[1] "\r\n<!DOCTYPE html>\n<html lang=\"en\">\n\n<head>\n <title>2016 QB Fantasy Football Rankings, QB Cheat Sheets, QB Draft / Draft Rankings</title>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n <meta name=\"description\" content=\"Don&#8217;t trust any 1 fantasy football expert? We combine their rankings into 1 Expert Consensus Ranking. Our 2016 Draft QB rankings are updated daily.\">\n<link rel=\"canonical\" href=\"https://www.fantasypros.com/nfl/rankings/qb-cheatsheets.php\" />\n\n <meta property=\"fb:pages\" content=\"184352014941166\"/>\n

It goes on for quite a ways more.

Can someone give me suggestions on how to get this to work?


Answer Source

Get html file from the URL

URL <- "https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))

Parse HTML file

doc <- htmlParse(temp)

XPath query is constructed by selecting the table element with class = "player-table" and its child tr element having class = 'mpb-player-'

xpexpr <- "//table[contains(@class, 'player-table')]/tbody/tr[contains(@class, 'mpb-player-')]"

Get the list of nodes from doc for the xpath expression

listofTableNodes <- getNodeSet(doc, xpexpr)

create a dataframe with the xmlvalues of the node list

df <- xmlToDataFrame(listofTableNodes, stringsAsFactors = FALSE)
# alternatively xpathSApply can be used to get the same data frame
# df <- xmlToDataFrame(xpathSApply(doc, xpexpr), stringsAsFactors = FALSE)

remove empty columns

df <- df[, seq(1, length(df), by = 2)]

add column names

xpexpr <- "//table[contains(@class, 'player-table')]/thead/tr/th"
listofTableNodes <- getNodeSet(doc, xpexpr)
colnames(df) <- gsub("[\r\n ]*$", '', xmlSApply(listofTableNodes, xmlValue))

#   Rank          Player (Team) Pos Bye Best Worst Avg Std Dev ADP vs. ADP
# 1    1     Antonio Brown PIT  WR1   8    1     5 1.3     0.8 1.0     0.0
# 2    2 Odell Beckham Jr. NYG  WR2   8    1     9 3.1     1.6 2.0     0.0
# 3    3       Julio Jones ATL  WR3  11    1     6 3.4     1.1 4.0    +1.0
# 4    4        Todd Gurley LA  RB1   8    1    11 4.5     2.3 3.0    -1.0
# 5    5     David Johnson ARI  RB2   9    1    19 6.1     3.5 6.0    +1.0
# 6    6   Adrian Peterson MIN  RB3   6    1    22 7.6     3.8 5.0    -1.0