Yannis Yannis - 29 days ago 12
Python Question

Extract specific information from a log file using a script (Linux or Windows or Python)

I am interested in learning how to extract information


  • count occurrences of keywords,

  • get the timestamp for specific occurrences of keywords (note that the timestamps will always for the same day; usually, within a couple of hours of the same day),

  • get the elapsed time of specific log entries



from a text log file (
log.txt
) via a script (Linux bash or Windows batch or Python). All this information should be written in another text file (
results.txt
) or be printed on the terminal.

Basically, all the other log entries (i.e. with the
blah blah
are ignored).


For example for the following text log file, where each line starts with a timestamp followed by an empty space, a dash line(-) and one or more empty space(s) followed by keywords:


11:59:35.875 - action - WRITE(34) start

11:59:35.875 - blah blah

11:59:35.875 - blah blah

11:59:35.877 - blah blah

11:59:35.897 - KEYWORD_1

11:59:35.975 - action - WRITE(34) end

11:59:36.992 - KEYWORD_1

11:59:36.999 - KEYWORD_1

11:59:37.535 - blah blah

11:59:37.545 - ACTION_A - STATE: type 2

11:59:37.575 - blah blah

11:59:37.577 - blah blah

11:59:37.845 - KEYWORD_2

11:59:37.945 - ACTION_B result

11:59:37.950 - blah blah

11:59:38.075 - action - WRITE(22) start

11:59:38.075 - blah blah

11:59:38.085 - blah blah

11:59:38.097 - KEYWORD_2

11:59:39.975 - action - WRITE(22) end


Firstly, I would like to count the occurrences of each of the
KEYWORD_1
and the
KEYWORD_2
(e.g. 2 and 2, respectively).

Secondly, I want to be able to print the timestamps of each
KEYWORD
occurrence, e.g.
11:59:35.897
for the first occurrence of the
KEYWORD_1
.

Finally, find the elapsed time between two log entries:


  1. those who start with
    - action - WRITE(#) start
    and end with
    - action - WRITE(#) end
    where
    #
    is any integer number, e.g. 11:59:35.975 - 11:59:35.875 =
    1ms
    for the first
    WRITE(34)

  2. those who start with
    - ACTION_A ...
    and end with
    ACTION_B ...
    e.g. 11:59:37.545 - 11:59:37.945 =
    4ms
    for the first
    ACTION_A .. ACTION_B
    .



I have tried
find /c "KEYWORD_1" log.txt >results.txt
(Windows batch) to count the occurrences but I cannot extract the respective timestamp. For the other requirements I have no idea how to start as I have no experience with such actions before. Tried adapting the answers from this question for my needs with no success.

Any code fragment example or link with related resources will be much appreciated.

Answer
@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "filename1=%sourcedir%\q40441783.txt"
:: occurrence count for KEYWORD_1,KEYWORD_2,KEYWORD_3
FOR %%k IN (KEYWORD_1 KEYWORD_2 KEYWORD_3) DO (
 FOR /f "delims=" %%c IN ('type "%filename1%"^|FIND /c "%%k"') DO ECHO %%k : %%c times
)
:: Timestamp display for KEYWORD_1,KEYWORD_2,KEYWORD_3
FOR %%k IN (KEYWORD_1 KEYWORD_2 KEYWORD_3) DO (
 type "%filename1%"|FIND "%%k"
)
:: remove variables starting $ or #
For %%b IN ($ #) DO FOR  /F "delims==" %%a In ('set %%b 2^>Nul') DO SET "%%a="
:: action - WRITE(#)
FOR /f "usebackqtokens=1*delims=- " %%a IN ("%filename1%") DO (
 REM interested in "pattern - WRITE(#) start/end"
 FOR /f "tokens=1-3*delims=-()" %%A IN ("%%b") DO (
  REM %%A is action, %%B "WRITE" %%C # %%D " start"/" end"
  IF "%%B"==" WRITE" IF "%%D"==" start" SET "$%%C %%A$=%%a"
  IF "%%B"==" WRITE" IF "%%D"==" end" SET "#%%C %%A#=%%a"
  CALL :formatch&CALL :report "WRITE(%%C)"
 )
)
SET $ 2>nul
SET # 2>nul
:: remove variables starting $ or #
For %%b IN ($ #) DO FOR  /F "delims==" %%a In ('set %%b 2^>Nul') DO SET "%%a="

FOR /f "usebackqtokens=1*delims=- " %%a IN ("%filename1%") DO (
 REM interested in "ACTION_A/ACTION_B elapsed time"
 FOR /f "tokens=1*delims=- " %%A IN ("%%b") DO (
  REM %%A is action, %%B remainder of line
  IF "%%A"=="ACTION_A" SET "$1$=%%a"&SET "_actiona=%%B"
  IF "%%A"=="ACTION_B" SET "#1#=%%a"
  CALL :formatch
  CALL :report "%%_actiona%% %%B"
 )
)
SET $ 2>nul
SET # 2>nul

GOTO :EOF


:: See whether a $something$ AND a #something# exist and report if so
:formatch
SET "elapsed="
FOR /f "tokens=1,2delims=$=" %%m IN ('set $ 2^>nul') DO (
 IF DEFINED #%%m# (
  CALL :elapsed %%n %%#%%m#%%
  SET "#%%m#="
  SET "$%%m$="
 )
)
GOTO :eof

:report
IF DEFINED elapsed ECHO %elapsed% %~1
GOTO :eof

:: get %2 - %1 both in HH:mm.ss.ttt format
:elapsed
FOR /f "tokens=1-4delims=:." %%w IN ("%2") DO (SET /a hh=2%%w&SET /a mm=2%%x&SET /a ss=2%%y&SET /a ttt=2%%z)
FOR /f "tokens=1-4delims=:." %%w IN ("%1") DO (SET /a hh-=1%%w&SET /a mm-=1%%x&SET /a ss-=1%%y&SET /a ttt-=1%%z)
:: compensate for "negatives"
IF %ttt% lss 1000 set/a ttt+=1000&set/a ss-=1
IF %ss% lss 100 set/a ss+=60&set/a mm-=1
IF %mm% lss 100 set/a mm+=60&set/a hh-=1
IF %hh% lss 100 set/a hh+=24
SET "elapsed=%hh:~-2%:%mm:~-2%:%ss:~-2%.%ttt:~-3%"
GOTO :eof

GOTO :eof

You would need to change the setting of sourcedir to suit your circumstances.
I used a file named q40441783.txt containing your data for my testing.

Interesting exercise.

The first two steps are fairly obvious. I included KEYWORD_3 to ensure a correct report was produced for "not found". Note that you state 2 occurrences for KEYWORD_1. Actually, in your posted data there are 3.

The next step required some explanation. First thing I do is to ensure that there are no variables starting # or $.

Next- analyse each line, splitting first on the first - or Space and then processing the part beyond the first delimiter-sequence tokenising on -() so that the tokens are as described in the REM statement. We then set the variable $...$ or #...# to the time in %%a. ... here is the unique part of the log entry - the number and action. Then check whether there is both a $...$ and #...# for the same ... and if so, clear the $#...$# variables, calculate the elapsed time, reconstruct the line and report.

The elapsed-time calculation prepends 2 to the start of each of the variables to ensure that they don't start 0 and potentially get treated as octal. Then pull the same trick with subtracting the start time, using a prepended 1 to produce a result that should be 3 digits (4 for ms). If fewer digits are detected, then we need to add an appropriate number and deduct one from the next-higher time element.

The processing for the action_a/_b timing is essentially the same, but records the start/end times in $!$/#1# since there is no indication of the nature of the strings action_a and action_b and we're forced to assume that the appropriate events don't overlap.

Comments