user1471980 user1471980 - 1 month ago 11
R Question

how do you parse log files in R quickly?

there is big files in a directory called c:/logs. I need to reach each log, grep for a few regular expressions and put the output to a data frame. This is what I have a and it is taking a very long time to run:

final<-data.frame()
path<-c("C:/logs")
logs<-dir(path, pattern = "log", full.names = TRUE, ignore.case = TRUE)

toMatch <- c("compile.cancelSwapCode", "compile.insertCode", "compile.getCode",
"compile.getCodes","compile.getCodeWithAnalysis")

for (i in logs){
print(i)
xx <- readLines(i)
xxx<-grep(paste(toMatch,collapse="|"), xx, value=TRUE)
df<-read.table(text=xxx)
final<-rbind(final, df)
}


The actual file sisez are 150MB. Is there a quick way to do this?

example log file:

2016-11-02 00:00:01,506 INFO [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - Received request for test request: [ticket=test101]
2016-11-02 00:00:01,514 INFO [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - request: [ticket=test101] found in cache, returning from Cache
2016-11-02 00:00:01,515 DEBUG [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - compileController.Code finished in 9ms
2016-11-02 00:00:01,578 INFO [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - Received request for test request: [ticket=test101]
2016-11-02 00:00:01,582 INFO [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - request: [ticket=test101] found in cache, returning from Cache
2016-11-02 00:00:01,582 DEBUG [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - compileController.Code finished in 4ms
2016-11-02 00:00:08,606 INFO [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - Received request for test request: [ticket=test102]
2016-11-02 00:00:08,606 INFO [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - request: [ticket=test102] found in cache, returning from Cache
2016-11-02 00:00:08,606 DEBUG [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - compileController.Code finished in 0ms
2016-11-02 00:00:09,320 INFO [[(JSK) mux request dispatch][/][tid=566][compileController.Code]] - Received request for test request: [ticket=test102]
2016-11-02 00:00:09,320 INFO [[(JSK) mux request dispatch][/][tid=566][compileController.Code]] - request: [ticket=test102] found in cache, returning from Cache

Answer

The major bottleneck in your code seems to be the call to read.table. By first concatenating the vector xxx into a single, newline-delimited string, you can pass it into data.tables's fread, which is much faster. Additionally, I've replaced the call to grep with stri_subset from the stringi package, and used data.table's rbindlist to combine everything, rather than iteratively calling rbind. I'm not sure how many files you have in total, but in the code below I'm using nine 180 MB files created from your sample.

library(data.table)
library(stringi)

files <- list.files(
    "/tmp",
    pattern = "logfile",
    full.names = TRUE
)
re <- paste(
    c("compile.cancelSwapCode", "compile.insertCode", 
      "compile.getCode", "compile.getCodes",
      "compile.getCodeWithAnalysis", "DEBUG"),
    collapse = "|"
)

system.time({

final <- rbindlist(
  lapply(files, function(f) {
      lines <- readLines(f)
      fread(
          paste0(
              stri_subset(lines, regex = re), 
              collapse = "\n"
          ),
          header = FALSE,
          sep = " ",
          colClasses = "character"
      )
  })
)    

})
#    user  system elapsed 
# 136.664   0.628 137.378

It looks like you are on Windows, so if you have Rtools installed and are able to leverage Rcpp, you can shave off some more time by replacing readLines with a dead simple C++ equivalent:

#include <Rcpp.h>
#include <fstream>

// [[Rcpp::export]]
Rcpp::CharacterVector read_lines(const char* file) {
    std::vector<std::string> res;
    res.reserve(10000);

    std::ifstream stream(file);
    std::string tmp;

    while (std::getline(stream, tmp)) {
        res.push_back(tmp);
    }

    return Rcpp::wrap(res);
}

Replacing readLines with read_lines in the above, I get

#   user  system elapsed 
# 39.200   0.376  39.596 

If you aren't able to get up and running with Rtools and Rcpp, using read_lines from the readr package will still be faster than readLines. Here is a comparison of all three, reading in those nine 180 MB files:

system.time({ lapply(files, readLines) })
#    user  system elapsed 
# 287.136   1.140 288.471 

system.time({ lapply(files, readr::read_lines) })
#   user  system elapsed 
# 91.568   0.604  91.895

system.time({ lapply(files, read_lines) })
#   user  system elapsed 
# 24.204   0.652  24.862 

Update

Regarding you comment below, IIUC you can do something like this:

final <- rbindlist(
    lapply(files, function(f) {
        lines <- readLines(f)
        matches <- stri_subset(lines, regex = re)
        data.table(
            stri_match_first(
                matches, 
                regex = "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"
            )
        )
  })
)[,.(Count = .N), by = "V1"]
Comments