Hardik Hardik - 1 year ago 75
R Question

Unable to scrape news website

I am creating a dataset from the following newsfeed rss

I am reading the following data from this xml

  • title

  • title url

  • pub date

I now use the title url to fetch the description (synopsis, below the main headline) - by hitting each url and scraping the data

However, I am facing mismatch in vector length (197) for description as against the others (which is 200).
Because of this I am unable to create my dataframe

Can someone help how can I scrape the data efficiently

below code is reproducible


url = "http://indianexpress.com/section/india/feed/"

newstopics = getURL(url)

newsxml = xmlParse(newstopics)

title <- xpathApply(newsxml, "//item/title", xmlValue)
title <- unlist(title)

titleurl <- xpathSApply(newsxml, '//item/link', xmlValue)
pubdate <- xpathSApply(newsxml, '//item/pubDate', xmlValue)

t1 = Sys.time()
desc <- NULL

for (i in 1:length(titleurl)){

page = read_html(titleurl[i])
temp = html_text(html_nodes(page,'.synopsis'))
desc = c(desc,temp)


print(difftime(Sys.time(), t1, units = 'sec'))

desc = gsub("\n",' ',desc)

newsdata = data.frame(title,titleurl,desc,pubdate)

I get the following error:

Error in data.frame(title, titleurl, desc, pubdate) :
arguments imply differing number of rows: 200, 197

Answer Source

You can do the following:


feed <- read_xml("http://indianexpress.com/section/india/feed/")

# helper function to extract information from the item node
item2vec <- function(item){
  tibble(title = xml_text(xml_find_first(item, "./title")),
         link = xml_text(xml_find_first(item, "./link")),
         pubDate = xml_text(xml_find_first(item, "./pubDate")))

dat <- feed %>% 
  xml_find_all("//item") %>% 

# The following takes a while
dat <- dat %>% 
  mutate(desc = map_chr(dat$link, ~read_html(.) %>% html_node('.synopsis') %>% html_text))

Which gives you a data.frame/tibble with 4 columns:

> glimpse(dat)
Observations: 200
Variables: 4
$ title   <chr> "Common man has no problem with note ban, says Santosh Gangwar", "Bombay High Court comes...
$ link    <chr> "http://indianexpress.com/article/india/india-news-india/demonetisation-note-ban-cash-cru...
$ pubDate <chr> "Mon, 21 Nov 2016 20:04:21 +0000", "Mon, 21 Nov 2016 20:01:43 +0000", "Mon, 21 Nov 2016 1...
$ desc    <chr> "MoS for Finance speaks to Indian Express in Bareilly, his Lok Sabha constituency.", "The...
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download