rrodrigorn0 rrodrigorn0 - 4 months ago 23
R Question

R: Saving inside a lapply

I need to write a file inside a lapply function. I'm scraping a large list of webpages and I would like to save the output every 100th or so. I use the following code

from = seq(1,100, 10)
aa <- length(url)
func1 = function(url){
out <- tryCatch(
{
aa <<- aa -1
print(aa)
doc = htmlParse(url)
address= as.data.frame(xpathSApply(doc,'//div[@class="panel-body"]', xmlValue, encoding="UTF-8"))
page = cbind(address,url)

if (aa %in% from){
pg = suppressMessages(melt(cc))
write.csv(pg,paste("bcc_",aa,".csv"))
}

}
cc = lapply(url, func1)


However, when I do this I get an error saying object "cc" is not found. I know this can be done using a for loop. But is there a way to accomplish this task using the apply function.

Answer

Build cc as an new environment object outside of your lapply.

e <- new.env()
e$cc <- list()
a <- letters[]
b <- 1:26
# Example lapply
out <- lapply(a, function(a,b){ 
  e$cc[[a]] <- b
  if(length(e$cc)%%10==0) print(length(e$cc))
  b # Giving an output to out aswell
  },b
)
# [1] 10
# [1] 20
# Showing first elements of outputs
# > e$cc
#$a
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#[26] 26
# > out
#[[1]]
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#[26] 26

Such method will allow you to build cc inside a new R environment which can then be enumerated mid-apply and will output your classical output. Not the most elegant solution though.

n.b. This solution will need to be modified to your code. Also reset e$cc with e$cc <- list() if need be, as after running once it will only replace elements.

ALTERNATIVELY: (UNTESTED!) You could try adapt your script into something like this.

func1 <- function(url){
  out <- tryCatch(
    {
     doc <-  htmlParse(url)
     address <- as.data.frame(xpathSApply(
                  doc,'//div[@class="panel-body"]', xmlValue, encoding="UTF-8")
                )
      page <- cbind(address,url)
     }
}
wrapfun <- function(urls){
  e <- new.env()
  e$cc <- list()
  lapply(urls, function(x){
    e$cc[[x]] <- func1(x)
    if(length(e$cc)%%10==0){ # Change the %%y to how often you want to save e.g length(e$cc)%%100==0 would be every 100.
      pg <-  suppressMessages(melt(e$cc))
      write.csv(pg,paste("bcc_",length(e$cc),".csv"))
    }
  })
  return(e$cc)
}
Comments