Ildar Gabdrakhmanov Ildar Gabdrakhmanov - 1 year ago 88
R Question

I have to grab plantext from over 290K webpages. Is there a way to improve the speed?

I have a vector with more than 290K urls leading to articles on news portal.
This is a sample:

tempUrls <- c("",

There is a code I use to download plan text:

GetPageText <- function(address) {

webpage <- getURL(address, followLocation = TRUE, .opts = list(timeout = 10))
pagetree <- htmlTreeParse(webpage, error = function(...) {}, useInternalNodes = TRUE, encoding = "UTF-8")
node <- getNodeSet(pagetree, "//div[@itemprop='articleBody']/..//p")
plantext <- xmlSApply(node, xmlValue)
plantext <- paste(plantext, collapse = "")
node <- getNodeSet(pagetree, "//title")
title <- xmlSApply(node, xmlValue)

return(list(plantext = plantext, title = title))

DownloadPlanText <- function() {

tempUrls <- c("",

for (i in 1:length(tempUrls)) {

There is a system.time list for those 6 links:

user system elapsed
0.081 0.004 3.754
user system elapsed
0.061 0.003 3.340
user system elapsed
0.069 0.003 3.115
user system elapsed
0.059 0.003 3.697
user system elapsed
0.068 0.004 2.788
user system elapsed
0.061 0.004 3.469

It means it takes 3 sec to download plan text from 1 link. For 290K link it takes 14500 min or 241 hours or 10 days.

Is there any way to improve it?

Answer Source

There are a few ways you can do this, but I highly suggest keeping a copy of the source pages since you may need to go back and scrape and it's beyond rude to hammer a site again if you forget something.

One of the best ways to do this archiving is to create a WARC file. We can do that with wget. You can install wget on macOS with homebrew (brew install wget).

Create a file with the URLs to scrape, one URL per line. For example, this is the content of lenta.urls:

At the terminal, create a directory to hold the output and make it your working directory since wget is non-deterministically not removing temporary files (which is super annoying). In this new directory, again-at the terminal-do:

wget --warc-file=lenta -i lenta.urls

That will go at the speed of your internet connection and retrieve the content of all of the pages in that file. It won't mirror (so it's not getting images, etc, just the content of the main page you wanted).

There may be many index.html[.###] files in this directory now due to that non-deterministic bug I mentioned. Before you delete them, make a backup of lenta.warc.gz since you just spent alot of time getting it and also annoying the folks that run that site and you don't want to have to do it again. Seriously, copy this to a separate drive/file/etc. Once you make this backup (you made a backup, right?) you can and should delete those index.html[.###] files.

We now need to read this file and extract the content. However, the R creators seem to be incapable of making gz file connections work with seeking consistently across platforms even though there are a dozen C/C++ libraries that do it just fine, so you'll have to uncompress that lenta.warc.gz file (double click on it or do gunzip lenta.warc.gz in the terminal).

Now that you have data to work with, here are some helper functions & libraries we'll need:


#' get the number of records in a warc request
warc_request_record_count <- function(warc_fle) {

  archive <- file(warc_fle, open="r")

  rec_count <- 0

  while (length(line <- readLines(archive, n=1, warn=FALSE)) > 0) {
    if (grepl("^WARC-Type: request", line)) {
      rec_count <- rec_count + 1



NOTE: the above function is needed since it's way more efficient to allocate the size of the list we're building to hold these records with a known value vs grow it dynamically, especially if you have those 200K+ sites to scrape.

#' create a warc record index of the responses so we can
#' seek right to them and slurp them up
warc_response_index <- function(warc_file,
                                record_count=warc_request_record_count(warc_file)) {

  records <- vector("list", record_count)
  archive <- file(warc_file, open="r")

  idx <- 0
  record <- list(url=NULL, pos=NULL, length=NULL)
  in_request <- FALSE

  while (length(line <- readLines(archive, n=1, warn=FALSE)) > 0) {

    if (grepl("^WARC-Type:", line)) {
      if (grepl("response", line)) {
        if (idx > 0) {
          records[[idx]] <- record
          record <- list(url=NULL, pos=NULL, length=NULL)
        in_request <- TRUE
        idx <- idx + 1
      } else {
        in_request <- FALSE

    if (in_request & grepl("^WARC-Target-URI:", line)) {
      record$url <- stri_match_first_regex(line, "^WARC-Target-URI: (.*)")[,2]

    if (in_request & grepl("^Content-Length:", line)) {
      record$length <- as.numeric(stri_match_first_regex(line, "Content-Length: ([[:digit:]]+)")[,2])
      record$pos <- as.numeric(seek(archive, NA))



  records[[idx]] <- record



NOTE: That function provides the locations of the web site responses so we can then get to them super-fast.

#' retrieve an individual response record
get_warc_response <- function(warc_file, pos, length) {

  archive <- file(warc_file, open="r")

  seek(archive, pos)
  record <- readChar(archive, length)

  record <- stri_split_fixed(record, "\r\n\r\n", 2)[[1]]
  names(record) <- c("header", "page")




Now, to slurp up all those pages, it's this simple:

warc_file <- "~/data/lenta.warc"

responses <- warc_response_index(warc_file)

Well, that just gets the location of all the pages in the WARC file. Here's how to get the content you need in a nice, tidy, data.frame:

map_df(responses, function(r) {

  resp <- get_warc_response(warc_file, r$pos, r$length)

  # the wget WARC response is sticking a numeric value as the first
  # line for URLs from this site (and it's not a byte-order-mark). so,
  # we need to strip that off before reading in the actual response.
  # i'm pretty sure it's the site injecting this and not wget since i
  # don't see it on other test URLs I ran through this for testing.

  pg <- read_html(stri_split_fixed(resp$page, "\r\n", 2)[[1]][2])

  html_nodes(pg, xpath=".//div[@itemprop='articleBody']/..//p") %>%
    html_text() %>%
    paste0(collapse="") -> plantext

  title <- html_text(html_nodes(pg, xpath=".//head/title"))

  data.frame(url=r$url, title, plantext, stringsAsFactors=FALSE)

}) -> df

And, we can see if it worked:

## Observations: 6
## Variables: 3
## $ url      <chr> "", "
## $ title    <chr> "Новым детским омбудсменом стал телеведущий Павел Астахов: Россия: Len...
## $ plantext <chr> "Президент РФ Дмитрий Медведев назначил нового уполномоченного по прав...

I'm sure others will have ideas for you (use GNU parallel with wget or curl at the command-line or use the parallel version of lapply with your existing code) but this process is ultimately friendlier to the web site provider and keeps a copy of the content locally for further processing. Plus, it's in an ISO standard format for web archives for which there are many, many tools for processing (soon to be a few in R, too).

Using R for file seeking/slurping like this is terrible but my package to work with WARC files isn't ready yet. It's C++-backed so it's much faster/efficient, but it's beyond the scope of an SO answer to add that much inline C++ code just for this answer.

Even with this method I've put here, I'd divide the URLs into chunks and process them in batches to be nice to the site and to avoid having to re-scrape in the event your connection goes down in the middle of this.

Astute wgetters will ask why I'm not using the cdx option here and it was to mostly avoid complexity and it's also kinda useless for the actual data processing since the R code has to seek to the records anyway. Using the cdx option (do a man wget to see what I'm referring to) would make it possible to restart interrupted WARC scrapes, but you have to be careful how you deal with it, so I just avoided the details of that for simplicity.

For the # of sites you have, look into the progress_estimated() function in dplyr and think about adding a progress bar to the map_df code.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download