Jordan Jordan - 1 year ago 71
R Question

Combining data frames while web scraping using rvest

I have the following code

library('rvest')
library('tidyverse')
test_url <- c('http://www.citact.org/senator-brent-waltz-r-greenwood-
district-36',
'http://www.citact.org/senator-ron-grooms-r-new-albany-
district-46',
'http://www.citact.org/representative-mike-speedy-r-
indianapolis-district-90')
test <- lapply(test_url, function(i){
web <- read_html(i)
grades <- html_nodes(web, 'td')
test_grades <- data.frame(one = (as.data.frame(html_text(grades), two =
'idk')))

first <- as.data.frame(test_grades[2:11, ])
second <- as.data.frame(test_grades[13:22, ])

names(test_grades) <- names(test_grades)
testing <- data.frame(c(first, second))
})

test_names <- lapply(test_url, function(i){
web <- read_html(i)
info <- html_nodes(web, 'h3')
text_info <- html_text(info)

names_test_df <- data_frame(member = text_info)
names_test_df <- separate(names_test_df, col = member, c('Useless',
'Info'), sep = ': ')
names_test_df <- separate(names_test_df, col = Info, c('names',
'District'), sep = ',')
names_test_df <- separate(names_test_df, col = names, c('Position',
'First', 'Last', 'Party')
, sep = ' ')
names_test_df <- separate(names_test_df, col = Party, c('Party','District
Name'), sep = '-')
})

y <- do.call(cbind.fill, c(list(do.call(rbind, test)), do.call(rbind,
test_names)))


This works in the sense that all of the information is gathered and there is no errors, but the issue lies with my final data frame which I have called y. The data frame test and the data frame test_names are not matching up when I create y. For example, some of the grades and years from test do not match the correct candidates in test_names. Is there a way to make sure that these are corresponding correctly? I tried combining the data frames prior to looping them, but I was unsuccessful in doing this. There might be a better way, that was just my initial plan.

Answer Source

Try to avoid reshaping and binding and do.call in every second line, it makes your code hard to read and thus also hard to find the bug.. I went ahead and decided to make it a bit simpler.

person_xpath <- "//h1[contains(@class, 'title gutter')]"

url_tables <- lapply(test_url, function(x){
  # Scraping the information (note, connecting to URL once)
  page           <- read_html(x) 
  vote_outcome   <- t(html_table(page)[[1]])
  personal_info  <- html_nodes(page, xpath = person_xpath)
  personal_info  <- html_text(personal_info)

  # Remove some useless characters and split the string
  personal_info  <- gsub("\\(|\\) |,", "", personal_info)
  split_person   <- strsplit(personal_info, " ")[[1]]

  # Prepare the personal info for a cbind:
  dupe_info      <- sapply(split_person, rep, nrow(vote_outcome))
  if(ncol(dupe_info) == 7){
    dupe_info[,4] <- paste(dupe_info[,4], dupe_info[,5])
    dupe_info     <- dupe_info[,-5]
  }
  df             <- cbind(dupe_info, vote_outcome)[,-5]
  colnames(df)   <- c("Position", "First", "Last", "Party", 
                      "District", "Year", "Outcome")
  return(df)
})

url_tables[[1]]
#     Position  First   Last    Party         District Year    
# X1  "Senator" "Brent" "Waltz" "R-Greenwood" "36"     " "    
# X2  "Senator" "Brent" "Waltz" "R-Greenwood" "36"     "2008" 
# X3  "Senator" "Brent" "Waltz" "R-Greenwood" "36"     "2009" 
# X4  "Senator" "Brent" "Waltz" "R-Greenwood" "36"     "2010" 
# ..     ...      ...     ...       ...        ...       ...

# X1  "Pro-Consumer Voting Percentage"
# X2  "0%"                            
# X3  "57%"                           
# X4  "66%"                           
# ..   ...
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download