karan kothari karan kothari - 4 months ago 9
HTML Question

Reading List of webpages in R and saving the output in csv

I have a list of 40,000 web-pages address in a csv file. I want to read these pages in a new csv file such that each cell in the csv is the content of the associated web page.
I am able to read (parse) a single webpage using the following code

library(XML)

# Read and parse HTML file
doc.html = htmlTreeParse('',useInternal = TRUE)

# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))

# Replace all \n by spaces
doc.text = gsub('\\n', ' ', doc.text)

# Join all the elements of the character vector into a single
# character string, separated by spaces
doc.text = paste(doc.text, collapse = ' ')


is it possible to use the csv having the web pages address as input and get a new file with all the content as mentioned above?

Answer

you could try the following code. It should work work your purposes, but its untested as I do not know the websites that you are meaning to look at:

library(XML)

df <- read.csv("Webpage_urls.csv", stringsAsFactors = F)

webpage_parser <- function(x){
  doc.html = htmlTreeParse(x, useInternal = TRUE)
  # Extract all the paragraphs (HTML tag is p, starting at
  # the root of the document). Unlist flattens the list to
  # create a character vector.
  doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))

  # Replace all \n by spaces
  doc.text = gsub('\\n', ' ', doc.text)
  # Join all the elements of the character vector into a single
  # character string, separated by spaces
  doc.text = paste(doc.text, collapse = ' ')
}

all_webpages <- lapply(df, function(x) webpage_parser(x))

Pages <- do.call(rbind, all_webpages)

Parsed_pages <- cbind(df, Pages)

write.csv(Parsed_pages, "All_parsed_pages.csv", row.names = F)