C. Martin C. Martin - 3 months ago 21
R Question

Web-scraping error in R

I'm learning how to do web-scraping in R and thought I'd try things out by using a page with a built in table. My ultimate goal is to have dataframe with four variables (Name, Party, Constituency, Link to individual webpage).

library(rvest)
library(XML)

url <- "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0"

constituency <- read_html(url)
print(constituency)

constituency_red <- constituency %>% html_nodes('td') %>% html_text()
constituency_red <- paste0(url, constituency_red)
constituency_red <- unique(constituency_red)
constituency_red


The output I get after completing these steps looks like I'm going in the right direction. However, as you can see when scrolling to the right it's still a bit of a mess. Any ideas on what I can do to clean this up?

[974] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0\r\n Poulter, Dr\r\n (Conservative)\r\n "
[975] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Central Suffolk and North Ipswich"
[976] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0\r\n Pound, Stephen\r\n (Labour)\r\n "
[977] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Ealing North"


After this I tried a second approach. The following code appears to give me a clean list of all the hyperlinks. So I'm wondering if this might be a potential work around?

constituency_links <- constituency %>% html_nodes("tr") %>% html_nodes('td') %>% html_nodes("a") %>% html_attr("href")
constituency_links <- paste0(url, constituency_links)
constituency_links <- unique(constituency_links)
constituency_links


My third and final try was to use the following code:

all_constituency <- lapply(constituency_links, function(x) read_html(x))
all_constituency


When I run this things slow down A LOT and then I start getting
Error in open.connection(x, "rb") : HTTP error 400.
So I tried running it as a loop instead.

for(i in constituency_links){
all_constituency[[i]] <- read_html(i)
}


I get the same error messages with this approach. Any suggestions on how to pull and clean this information would be much appreciated.

Answer

It's pretty straightforward:

library(rvest)
library(stringi)
library(purrr)
library(dplyr)

pg <- read_html("http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0")
td_1 <- html_nodes(pg, xpath=".//td[contains(@id,'ctl00_ctl00_FormContent_SiteSpecificPlaceholder_PageContent_rptMembers_ctl')]")

data_frame(mp_name=html_text(html_nodes(td_1, "a")),
           href=html_attr(html_nodes(td_1, "a"), "href"),
           party=map_chr(stri_match_all_regex(html_text(td_1), "\\((.*)\\)"), 2),
           constituency=html_text(html_nodes(pg, xpath=".//tr/td[2]"))) -> df

glimpse(df)
## Observations: 649
## Variables: 4
## $ mp_name      <chr> "Abbott, Ms Diane", "Abrahams, Debbie", "Adams, N...
## $ href         <chr> "http://www.parliament.uk/biographies/commons/ms-...
## $ party        <chr> "Labour", "Labour", "Conservative", "Conservative...
## $ constituency <chr> "Hackney North and Stoke Newington", "Oldham East...
Comments