shum shum - 3 years ago 169
R Question

scraping HTML data.table using rvest

I'm trying to scrape the "Fish Sampled" table data from
Minnesota DNR using R rvest package. I used the chrome extension SelectorGadget to find the xpath for the table. I'm unable to get any table data from the webpage into R. Any help is appreciated

library(rvest)

urllakes<- read_html("http://www.dnr.state.mn.us/lakefind/showreport.html?
downum=27011700")

lakesnodes <- html_nodes(urllakes,xpath = '//*[(@id = "lake-survey")]')

html_table(lakesnodes,fill=TRUE) #Error: html_name(x) == "table" is not TRUE
html_text(lakesnodes) # [1] "" but no data is returned

Answer Source

Start a new tab. Open Developer Tools. Then, go to http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27011700.

Go to the Network tab. Look for this:

enter image description here

That's your target. With the following, you can pass in a MN DNR URL or just the id at the end of the URL and get data back.

library(httr)
library(jsonlite)

read_lake_survey <- function(orig_url_or_id) {

  orig_url_or_id <- orig_url_or_id[1]

  if (grepl("^htt", orig_url_or_id)) {
    tmp <- httr::parse_url(orig_url_or_id)
    if (!is.null(tmp$query$downum)) {
      orig_url_or_id <- tmp$query$downum
    } else {
      stop("Invalid URL specified", call.=FALSE)
    }
  }

  httr::GET(
    url = "http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
    query = list(
      type = "lake_survey",
      callback = "",
      id = orig_url_or_id,
      `_` = as.numeric(Sys.time())
    )
  ) -> res

  httr::stop_for_status(res)

  out <- httr::content(res, as="text", encoding="UTF-8")
  out <- jsonlite::fromJSON(out, flatten=TRUE)
  out

}

Like so:

orig_url <- "http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27011700"

str(read_lake_survey(orig_url), 2)
## List of 4
##  $ timestamp: int 1506900750
##  $ status   : chr "SUCCESS"
##  $ result   :List of 13
##   ..$ averageWaterClarity: chr "7.0"
##   ..$ sampledPlants      : list()
##   ..$ officeCode         : chr "F314"
##   ..$ littoralAcres      : int 76
##   ..$ shoreLengthMiles   : num 2.45
##   ..$ areaAcres          : num 152
##   ..$ surveys            :'data.frame':  6 obs. of  52 variables:
##   ..$ accesses           :'data.frame':  1 obs. of  5 variables:
##   ..$ lakeName           : chr "Weaver"
##   ..$ DOWNumber          : chr "27011700"
##   ..$ waterClarity       : chr [1, 1:2] "07/14/2008" "7"
##   ..$ meanDepthFeet      : num 20.7
##   ..$ maxDepthFeet       : int 57
##  $ message  : chr "Normal execution."

str(read_lake_survey("27011700"), 2)
## List of 4
##  $ timestamp: int 1506900750
##  $ status   : chr "SUCCESS"
##  $ result   :List of 13
##   ..$ averageWaterClarity: chr "7.0"
##   ..$ sampledPlants      : list()
##   ..$ officeCode         : chr "F314"
##   ..$ littoralAcres      : int 76
##   ..$ shoreLengthMiles   : num 2.45
##   ..$ areaAcres          : num 152
##   ..$ surveys            :'data.frame':  6 obs. of  52 variables:
##   ..$ accesses           :'data.frame':  1 obs. of  5 variables:
##   ..$ lakeName           : chr "Weaver"
##   ..$ DOWNumber          : chr "27011700"
##   ..$ waterClarity       : chr [1, 1:2] "07/14/2008" "7"
##   ..$ meanDepthFeet      : num 20.7
##   ..$ maxDepthFeet       : int 57
##  $ message  : chr "Normal execution."

str(read_lake_survey("http://example.com"))
##  Error: Invalid URL specified 
##    3. stop("Invalid URL specified", call. = FALSE) 
##    2. read_lake_survey("http://example.com") 
##    1. str(read_lake_survey("http://example.com")) 

You can poke at it to prove it's all there.

library(tidyverse)

# get the data into a variable
dat <- read_lake_survey(orig_url)

# focus on the surveys
surveys <- dat$result$surveys

There are "n" data frames for the surveys that match the popup on the page.

There are also many other list elements with "n" entries that are associated with the surveys in the same popup. I don't do this type of analysis so i don't know what makes sense to put with the data frames or not.

This is likely enough to get you going a bit further. It's just adding other elements to the surveys.

map2(surveys$fishCatchSummaries, surveys$surveyDate, ~{ .x$survey_date <- .y ; .x }) %>% 
  map2(surveys$surveyType, ~{ .x$survey_type <- .y ; .x }) %>% 
  map2(surveys$surveySubType, ~{ .x$survey_subtype <- .y ; .x }) %>% 
  map2_df(surveys$surveyID, ~{ .$survey_id <- .y ; .x }) %>% 
  as_tibble() %>% 
  type_convert() %>% 
  glimpse()
## Observations: 120
## Variables: 12
## $ quartileCount  <chr> "0.5-7.5", "0.7-4.2", "N/A", "0.4-2.2", "0.9-5.7", "1.5-7.3"...
## $ CPUE           <dbl> 25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...
## $ totalCatch     <int> 50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...
## $ species        <chr> "YEB", "PMK", "HSF", "WTS", "YEB", "NOP", "BLG", "BLC", "BLC...
## $ totalWeight    <dbl> 41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....
## $ quartileWeight <chr> "0.5-0.8", "0.1-0.2", "N/A", "1.5-2.4", "0.5-0.8", "2.0-3.5"...
## $ averageWeight  <dbl> 0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...
## $ gearCount      <int> 2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...
## $ gear           <chr> "Standard gill nets", "Standard trap nets", "Standard trap n...
## $ survey_date    <date> 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...
## $ survey_type    <chr> "Standard Survey", "Standard Survey", "Standard Survey", "St...
## $ survey_subtype <chr> "Population Assessment", "Population Assessment", "Populatio...

If you're not familiar with piping, it's just a way to avoid temporary variables.

tmp <- map2(surveys$fishCatchSummaries, surveys$surveyDate, ~{ .x$survey_date <- .y ; .x })
tmp <- map2(tmp, surveys$surveyType, ~{ .x$survey_type <- .y ; .x })
tmp <- map2(tmp, surveys$surveySubType, ~{ .x$survey_subtype <- .y ; .x })
tmp <- map2_df(tmp, surveys$surveyID, ~{ .$survey_id <- .y ; .x })
tmp <- as_tibble(tmp)
final_data <- type_convert(tmp)

glimpse(final_data)
## Observations: 120
## Variables: 12
## $ quartileCount  <chr> "0.5-7.5", "0.7-4.2", "N/A", "0.4-2.2", "0.9-5.7", "1.5-7.3"...
## $ CPUE           <dbl> 25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...
## $ totalCatch     <int> 50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...
## $ species        <chr> "YEB", "PMK", "HSF", "WTS", "YEB", "NOP", "BLG", "BLC", "BLC...
## $ totalWeight    <dbl> 41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....
## $ quartileWeight <chr> "0.5-0.8", "0.1-0.2", "N/A", "1.5-2.4", "0.5-0.8", "2.0-3.5"...
## $ averageWeight  <dbl> 0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...
## $ gearCount      <int> 2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...
## $ gear           <chr> "Standard gill nets", "Standard trap nets", "Standard trap n...
## $ survey_date    <date> 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...
## $ survey_type    <chr> "Standard Survey", "Standard Survey", "Standard Survey", "St...
## $ survey_subtype <chr> "Population Assessment", "Population Assessment", "Populatio...

final_data
## # A tibble: 120 x 12
##    quartileCount  CPUE totalCatch species totalWeight quartileWeight averageWeight gearCount               gear survey_date     survey_type        survey_subtype
##            <chr> <dbl>      <int>   <chr>       <dbl>          <chr>         <dbl>     <int>              <chr>      <date>           <chr>                 <chr>
##  1       0.5-7.5  25.0         50     YEB       41.75        0.5-0.8          0.83         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  2       0.7-4.2   3.6         18     PMK        2.30        0.1-0.2          0.13         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  3           N/A   4.0         20     HSF        4.50            N/A          0.23         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  4       0.4-2.2   0.5          1     WTS        3.50        1.5-2.4          3.50         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  5       0.9-5.7   5.0         25     YEB       24.25        0.5-0.8          0.97         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  6       1.5-7.3  17.5         35     NOP      146.25        2.0-3.5          4.18         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  7           N/A   6.5         13     BLG        3.25            N/A          0.25         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  8      2.5-16.5   1.0          2     BLC        0.60        0.1-0.3          0.30         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  9      1.8-21.2   0.8          4     BLC        1.45        0.2-0.3          0.36         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
## 10           N/A   0.2          1     NOP        2.50            N/A          2.50         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
## # ... with 110 more rows
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download