Rime Rime - 26 days ago 7
R Question

how do to scrape web table from website using R

I am trying to scrape the table found from the following site:

https://finance.yahoo.com/gainers?e=us


However, I have searched for answers on several different methods to scrape tables from site on here and none of the methods have worked for me.

I have tried:

library(xml2)
url <- "https://finance.yahoo.com/gainers?e=us"
tbl <- read_html(url)


also:

library(XML)
url <- "https://finance.yahoo.com/gainers?e=us"
tbl <- readHTMLList(url)


and other packages such as
rvest
however I cannot get the table to show!

Answer

They store that data in a javascript block in the page itself. You have two choices: either use RSelenium and then deal with a really complex table that you have to unwind or perform some string surgery combined with some JSON mangling after getting some help from the V8 package:

library(V8)
library(xml2)
library(stringi)
library(jsonlite)
library(ndjson)

pg <- read_html("https://finance.yahoo.com/gainers?e=us")

This next bit extracts the data from the salient <script> tag. I'm doing this by position, which means this is the first potential thing to break if Yahoo! ever changes it format (to, say, insert useless Verizon Wireless ads via javascript if VZ ever does finish buying them). You can change this to look for text indicators, but that leads to the second issue…

The second issue with "scraping" this way is that if the javascript for the data ever changes this will also break. However that's also true for HTML-based scraping. This method doesn't require starting up a web server and a control server for navigating the web server. Anyway…

html_nodes(pg, "script")[13] %>% 
  html_text() %>% 
  stri_replace_first_fixed("(function (root) {", "var root = { App : {}};\n") %>% 
  stri_replace_last_fixed("}(this));", "") -> js

We had to do ^^ since the javascript code in the <script> tag was expecting to be in a browser (which we won't be). Now, we evaluate that large javascript chunk via V8 and extract the main data element:

ctx <- v8()
ctx$eval(JS(js))
root <- ctx$get("root", flatten=TRUE)

That data element holds all of the data for that page (which is really a single page app). So we have to find the data we care about which is really far down a nested javascript hole:

quotes <- root$App$main$context$dispatcher$stores$`QuoteDataStore-Immutable`$quoteData

There are numerous ways to turn that nested list data into a nice, rectangular data frame. The method below chooses the "turn it into JSON then bring it back from JSON in a completely "flat" format. Feel free to use other methods you can find on SO.

The code starts by telling R to ignore the non-stocks (since you, presumably, want just the quotes in the pretty table on that page):

discard(names(quotes), ~grepl("[\\^\\=]", .)) %>% 
  map_df(~ndjson::flatten(toJSON(quotes[[.]]))) %>% 
  glimpse()

## Observations: 30
## Variables: 53
## $ averageDailyVolume3Month.fmt.0      <chr> "437,939", "541,801", "1.033M", "992,278", "1.40...
## $ averageDailyVolume3Month.longFmt.0  <chr> "437,939", "541,801", "1,033,453", "992,278", "1...
## $ averageDailyVolume3Month.raw.0      <dbl> 437939, 541801, 1033453, 992278, 1402175, 537906...
## $ exchange.0                          <chr> "NYQ", "NMS", "NYQ", "NGM", "NGM", "NMS", "NGM",...
## $ exchangeTimezoneName.0              <chr> "America/New_York", "America/New_York", "America...
## $ exchangeTimezoneShortName.0         <chr> "EST", "EST", "EST", "EST", "EST", "EST", "EST",...
## $ fiftyTwoWeekHigh.fmt.0              <chr> "15.36", "46.12", "15.82", "7.64", "5.97", "9.86...
## $ fiftyTwoWeekHigh.raw.0              <dbl> 15.360, 46.120, 15.820, 7.640, 5.970, 9.860, 5.0...
## $ fiftyTwoWeekHighChange.fmt.0        <chr> "-6.96", "-17.72", "-4.19", "-2.83", "-4.29", "-...
## $ fiftyTwoWeekHighChange.raw.0        <dbl> -6.960, -17.720, -4.190, -2.830, -4.290, -3.640,...
## $ fiftyTwoWeekHighChangePercent.fmt.0 <chr> "-45.31%", "-38.42%", "-26.49%", "-37.04%", "-71...
## $ fiftyTwoWeekHighChangePercent.raw.0 <dbl> -0.4531, -0.3842, -0.2649, -0.3704, -0.7186, -0....
## $ fiftyTwoWeekLow.fmt.0               <chr> "6.59", "23.75", "9.46", "4.22", "0.25", "4.97",...
## $ fiftyTwoWeekLow.raw.0               <dbl> 6.590, 23.750, 9.460, 4.220, 0.251, 4.970, 0.970...
## $ fiftyTwoWeekLowChange.fmt.0         <chr> "1.81", "4.65", "2.17", "0.59", "1.43", "1.25", ...
## $ fiftyTwoWeekLowChange.raw.0         <dbl> 1.810, 4.650, 2.170, 0.590, 1.429, 1.250, 0.190,...
## $ fiftyTwoWeekLowChangePercent.fmt.0  <chr> "27.47%", "19.58%", "22.94%", "13.98%", "569.32%...
## $ fiftyTwoWeekLowChangePercent.raw.0  <dbl> 0.2747, 0.1958, 0.2294, 0.1398, 5.6932, 0.2515, ...
## $ fullExchangeName.0                  <chr> "NYSE", "NasdaqGS", "NYSE", "NasdaqGM", "NasdaqG...
## $ gmtOffSetMilliseconds.0             <dbl> -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07...
## $ invalid.0                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isLoading.0                         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ language.0                          <chr> "en-US", "en-US", "en-US", "en-US", "en-US", "en...
## $ longName.0                          <chr> "Bankrate, Inc.", "Air Methods Corp.", "Bristow ...
## $ market.0                            <chr> "us_market", "us_market", "us_market", "us_marke...
## $ marketCap.fmt.0                     <chr> "758.268M", "1.082B", "407.786M", "494.03M", "33...
## $ marketCap.longFmt.0                 <chr> "758,267,968", "1,081,625,344", "407,786,176", "...
## $ marketCap.raw.0                     <dbl> 758267968, 1081625344, 407786176, 494030272, 338...
## $ marketState.0                       <chr> "CLOSED", "CLOSED", "CLOSED", "CLOSED", "CLOSED"...
## $ messageBoardId.0                    <chr> "finmb_30061", "finmb_24494", "finmb_292980", "f...
## $ quoteType.0                         <chr> "EQUITY", "EQUITY", "EQUITY", "EQUITY", "EQUITY"...
## $ regularMarketChange.fmt.0           <chr> "1.20", "3.65", "1.44", "0.59", "0.27", "0.80", ...
## $ regularMarketChange.raw.0           <dbl> 1.20, 3.65, 1.44, 0.59, 0.27, 0.80, 0.15, 0.04, ...
## $ regularMarketChangePercent.fmt.0    <chr> "16.67%", "14.75%", "14.13%", "13.98%", "19.15%"...
## $ regularMarketChangePercent.raw.0    <dbl> 16.6667, 14.7475, 14.1315, 13.9810, 19.1489, 14....
## $ regularMarketDayHigh.fmt.0          <chr> "9.90", "30.25", "12.60", "5.29", "1.85", "6.30"...
## $ regularMarketDayHigh.raw.0          <dbl> 9.9000, 30.2500, 12.6000, 5.2900, 1.8500, 6.2950...
## $ regularMarketDayLow.fmt.0           <chr> "7.90", "25.30", "9.46", "4.30", "1.33", "5.79",...
## $ regularMarketDayLow.raw.0           <dbl> 7.9000, 25.3000, 9.4600, 4.3000, 1.3300, 5.7900,...
## $ regularMarketPrice.fmt.0            <chr> "8.40", "28.40", "11.63", "4.81", "1.68", "6.22"...
## $ regularMarketPrice.raw.0            <dbl> 8.40, 28.40, 11.63, 4.81, 1.68, 6.22, 1.16, 1.20...
## $ regularMarketTime.fmt.0             <chr> "4:02PM EDT", "4:00PM EDT", "4:00PM EDT", "4:00P...
## $ regularMarketTime.raw.0             <dbl> 1478289722, 1478289600, 1478289614, 1478289600, ...
## $ regularMarketVolume.fmt.0           <chr> "2.383M", "3.32M", "3.411M", "4.567M", "4.232M",...
## $ regularMarketVolume.longFmt.0       <chr> "2,382,856", "3,320,072", "3,411,052", "4,566,79...
## $ regularMarketVolume.raw.0           <dbl> 2382856, 3320072, 3411052, 4566790, 4232135, 185...
## $ sharesOutstanding.fmt.0             <chr> "90.27M", "38.085M", "35.063M", "102.709M", "20....
## $ sharesOutstanding.longFmt.0         <chr> "90,270,000", "38,085,400", "35,063,300", "102,7...
## $ sharesOutstanding.raw.0             <dbl> 90270000, 38085400, 35063300, 102709000, 2016800...
## $ shortName.0                         <chr> "Bankrate, Inc. Common Stock", "Air Methods Corp...
## $ sourceInterval.0                    <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, ...
## $ symbol.0                            <chr> "RATE", "AIRM", "BRS", "CERS", "EBIO", "ELNK", "...
## $ uuid.0                              <chr> "79521cde-a3ef-383f-917d-31c49f9082f5", "f0432c1...

I think you still have some data conversion to do (depending on what you're after), but you have the data that you're looking for.

Comments