user1471980 user1471980 - 1 month ago 7
R Question

how do you pick certain lines from a log file with R

there is a log file like this and it has many entries in them:

2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3

Report Job information
Job ID : 12345
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:18:52.08
Report mode : Simple

2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3

Report Job information
Job ID : 12367
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General1
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:45:52.08
Report mode : Simple


I need to be able to extract this part from the log and put in a data frame format. The log will have many entries like this, every time the script see this set up entries in the log, it should append to the data frame.

Report Job information
Job ID : 12367
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General1
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:45:52.08
Report mode : Simple


I've tried this:

i<-c("app.log")
xx <- readLines(i)
ii <- 10 + grep("Report Job information",xx, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(xx[ii]," ")[[2]])


Any ideas how could do this?

Given this data:

Job ID : 12345
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:18:52.08
Report mode : Simple


Each set of entry need to be one row. For example:
title=Report Job information,jobid=12345, JobName=Background Execution, job priority=5,jobgroup=normal, reportgroup=General,ModifiedDate=2016-09-16 15:18:52.08 etc

Answer

Here are the data, read in (note, using dplyr/magrittr pipe):

inData <-
"2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3

Report Job information
Job ID : 12345
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:18:52.08
Report mode : Simple

2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3

Report Job information
Job ID : 12367
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General1
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:45:52.08
Report mode : Simple" %>%
  strsplit("\\n") %>%
  unlist

I think that the easiest is to first find all of the starting positions. Here, I assume that "Report Job Information" is at the top of each report (but that it does not need to be included in the data.frame, so I start one after it).

reportStarts <-
  grep("Report Job information", inData) + 1

Next, I find where the reports end. Not knowing more about the structure, I am assuming that "Report mode" is the last thing reported. If, instead, it is a set number of lines, or some other break indicator, you may need to modify this step.

reportStops <-
  grep("Report mode", inData)

Next, I loop through each of the reports (using idx to indicate which report I am working on). the approach first limits to the lines in this report, then splits each by " : " which appears to be the field indicator, and uses the part before the " : " to name the output (I use paste as a defense in case the value also includes a " : "). Finally, I stitch each into a data.frame, then bind the rows with bind_rows to ensure that it appropriately handles any extra columns that may be present in some reports.

lapply(1:length(reportStarts), function(idx){

  inData[reportStarts[idx]:reportStops[idx]] %>%
    strsplit(" : ") %>%
    sapply(function(thisField){
      setNames(paste(thisField[-1]
                     , collapse = " : ")
               , thisField[1])
    }) %>%
    rbind %>%
    data.frame()

}) %>%
  bind_rows()

Outputs this data.frame:

  Job.ID             Job.name Job.priority Job.group Report.group Job.description   User     Report Modified.user          Modified.date Report.mode
1  12345 Background Execution            5    normal      General           Brief user01 General101            12 2016-09-16 15:18:52.08      Simple
2  12367 Background Execution            5    normal     General1           Brief user01 General101            12 2016-09-16 15:45:52.08      Simple
Comments