BlueDevilPride BlueDevilPride - 2 months ago 20
R Question

Import SharePoint site data into R

I came across a public data set, that I can't figure out how to get directly into R. Typically, I use the following R code to pull data off the web:

temp <- tempfile()
download.file("http://www.webaddress.com",temp)
data <- read.csv(unz(temp, "name_of_file"))
unlink(temp)


This SEC site, however has me a bit confused on how to get it directly into R. One reason, is that when you right click on the link, instead of a web address, you get the following code:

javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$cphMain$lnkSECReport", "", false, "", "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", false, true))


Here is the web address: http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx

Is there a way to get this data directly into R? As of now, I download then open with 7-zip, save into excel, then import into R.

Updated Code

library(httr)
library(xml2)

res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC",
httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"),
body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"),
encode = "form")

writeBin(content(res, as="raw"), "report.gz")
gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
close(gzf)


xml_find_all(doc, ".//Firms/Firm/Info") %>%
xml_attr("LegalNm") %>%
head(10)

Answer

It's one of those truly, horribly, wretched SharePoint sites that are popping up like crazy across almost every government e-initiative globally and making data more vs less opaque.

Having said that, I'm completely amazed this worked:

library(httr)
library(xml2)

res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
           httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"), 
           body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"), 
           encode = "form")

I used curlconverter to extract the web call after cancelling the direct download and viewing said web call in Developer Tools (which had to be up before the download started).

The "raw" computed httr request function looked like this:

httr::VERB(verb = "POST", url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
           httr::add_headers(Origin = "http://www.adviserinfo.sec.gov", 
                             `Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8", 
                             `Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36", 
                             Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
                             `Cache-Control` = "max-age=0", Referer = "http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx", 
                             Connection = "keep-alive", DNT = "1"), httr::set_cookies(ASP.NET_SessionId = "vp5bt2nrl5m3l4tqq4mkbfrz"), 
           body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport", 
                       `__EVENTARGUMENT` = "", `__VIEWSTATE` = "/wEPDwUIOTg2OTY2NjYPZBYCZg9kFgQCAQ8WAh4EVGV4dAUeSUFQRCAtIEludmVzdG1lbnQgQWR2aXNlciBEYXRhZAIDD2QWAgIFD2QWEAIDDw8WAh4LUG9zdEJhY2tVcmwFUn4vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMWAh4Hb25jbGljawWvAWdhKCdzZW5kJywgJ3BhZ2V2aWV3JywgeydwYWdlJzogJ34vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMnLCAndGl0bGUnOiAnSUFQRCAtIFNFQyBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J0IChHWklQKSd9KTtkAgcPZBYCZg8PFgIfAAVKUmVwb3J0IGFzIG9mOiA8Yj5TZXB0ZW1iZXIgNiwgMjAxNjwvYj4gPGJyLz5BcHByb3hpbWF0ZSBmaWxlIHNpemU6IDM3IE1CICBkZAINDw8WAh8BBVR+L0lBUEQvQ29udGVudC9CdWxrRmVlZC9Db21waWxhdGlvbkRvd25sb2FkLmFzcHg/RmVlZFBLPTM3MjY1JkZlZWRUeXBlPUlBX0ZJUk1fU1RBVEUWAh8CBbMBZ2EoJ3NlbmQnLCAncGFnZXZpZXcnLCB7J3BhZ2UnOiAnfi9JQVBEL0NvbnRlbnQvQnVsa0ZlZWQvQ29tcGlsYXRpb25Eb3dubG9hZC5hc3B4P0ZlZWRQSz0zNzI2NSZGZWVkVHlwZT1JQV9GSVJNX1NUQVRFJywgJ3RpdGxlJzogJ0lBUEQgLSBTdGF0ZSBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J... <truncated>
                       `__VIEWSTATEGENERATOR` = "C7F140E8", `__PREVIOUSPAGE` = "_n_AIWFFdFo0uFQroVexEbLyjk41mQczgUv0yM_5WfsMAs5Mr4_W9OsfhauW1md49E6AtLMLKvwsM3efjdsFxSQVs8m60rXjM2G3a38s-vs9jeifY7Z97KwNciQDnS3E0", 
                       `__EVENTVALIDATION` = "/wEdAAQgBK7oCoSH1SyM/nnv4+7OQ6BBh5UglL0V4PbvTmfHL5ETgQBTBoVSpnQmZd0nxKz/1ubqHHzGDP0ztOLUKJjXWi90IlgKV4uaEBSHcRvGBiO1/K20oSh88Xa2qq9BBCI="), 
                       encode = "form")

and, it's been my experience that these truly evil SharePoint sites need the various "view state" information, but I took a shot at whittling down and transforming the call and it's working (at least in the 2 minutes after I went to the site originally).

You're not out of the woods yet since:

res$headers$`content-type`
## "application/x-gzip; charset=utf-8"

even if you add:

`Accept-Encoding` = "gzip, deflate"

to the add_headers() call.

So, since memDecompress() is an absolutely useless function, you need to:

writeBin(content(res, as="raw"), "report.gz")

To get the gzip'd content into a file.

Now, we can work with it directly:

gzf <- gzfile("report.gz")

doc <- read_xml(gzf)
## [1] "LAUNCH ANGELS MANAGEMENT COMPANY, LLC"       "JACOBSEN CAPITAL MANAGEMENT, LLC"           
## [3] "CORESTATES CAPITAL ADVISORS, LLC"            "MINNEAPOLIS PORTFOLIO MANAGEMENT GROUP, LLC"
## [5] "SHANNON RIVER FUND MANAGEMENT, LLC"          "AAC BENELUX HOLDING BV"                     
## [7] "WILLINK ASSET MANAGEMENT LLC"                "SPIVAK ASSET MANAGEMENT, LLC"               
## [9] "ANNALY MANAGEMENT COMPANY LLC"               "WOODMONT INVESTMENT COUNSEL, LLC"           
close(gzf)

xml_find_all(doc, ".//Firms/Firm/Info") %>% 
  xml_attr("LegalNm") %>% 
  head(10)

I haven't tried it, but I suspect you can take:

javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(
  -----> "ctl00$cphMain$lnkStateReport", 
  "", 
  false, 
  "", 
  -----> "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37265&FeedType=IA_FIRM_STATE", 
  false, 
  true))

the items identified by -----> and put them in the obvious places in url and body area to get other content. Those parameters are from the "State Investment Adviser Report" button link source.

If you really don't want to write content to a file, you can try a non-exposed function in an alpha package of mine to inflate the gzip'd raw content directly in R:

devtools::install_git("https://gitlab.com/hrbrmstr/warc.giz")

raw_report <- warc:::gzuncompress(content(res, as="raw"), 50*1024*1024)
doc <- read_xml(raw_report)
...
Comments