Yannis Yannis - 1 year ago 54
Python Question

Extract specific information from a log file using a script (Linux or Windows or Python)

I am interested in learning how to extract information

  • count occurrences of keywords,

  • get the timestamp for specific occurrences of keywords (note that the timestamps will always for the same day; usually, within a couple of hours of the same day),

  • get the elapsed time of specific log entries

from a text log file (
) via a script (Linux bash or Windows batch or Python). All this information should be written in another text file (
) or be printed on the terminal.

Basically, all the other log entries (i.e. with the
blah blah
are ignored).

For example for the following text log file, where each line starts with a timestamp followed by an empty space, a dash line(-) and one or more empty space(s) followed by keywords:

11:59:35.875 - action - WRITE(34) start

11:59:35.875 - blah blah

11:59:35.875 - blah blah

11:59:35.877 - blah blah

11:59:35.897 - KEYWORD_1

11:59:35.975 - action - WRITE(34) end

11:59:36.992 - KEYWORD_1

11:59:36.999 - KEYWORD_1

11:59:37.535 - blah blah

11:59:37.545 - ACTION_A - STATE: type 2

11:59:37.575 - blah blah

11:59:37.577 - blah blah

11:59:37.845 - KEYWORD_2

11:59:37.945 - ACTION_B result

11:59:37.950 - blah blah

11:59:38.075 - action - WRITE(22) start

11:59:38.075 - blah blah

11:59:38.085 - blah blah

11:59:38.097 - KEYWORD_2

11:59:39.975 - action - WRITE(22) end

Firstly, I would like to count the occurrences of each of the
and the
(e.g. 2 and 2, respectively).

Secondly, I want to be able to print the timestamps of each
occurrence, e.g.
for the first occurrence of the

Finally, find the elapsed time between two log entries:

  1. those who start with
    - action - WRITE(#) start
    and end with
    - action - WRITE(#) end
    is any integer number, e.g. 11:59:35.975 - 11:59:35.875 =
    for the first

  2. those who start with
    - ACTION_A ...
    and end with
    ACTION_B ...
    e.g. 11:59:37.545 - 11:59:37.945 =
    for the first

I have tried
find /c "KEYWORD_1" log.txt >results.txt
(Windows batch) to count the occurrences but I cannot extract the respective timestamp. For the other requirements I have no idea how to start as I have no experience with such actions before. Tried adapting the answers from this question for my needs with no success.

Any code fragment example or link with related resources will be much appreciated.

Answer Source
SET "sourcedir=U:\sourcedir"
SET "filename1=%sourcedir%\q40441783.txt"
:: occurrence count for KEYWORD_1,KEYWORD_2,KEYWORD_3
 FOR /f "delims=" %%c IN ('type "%filename1%"^|FIND /c "%%k"') DO ECHO %%k : %%c times
:: Timestamp display for KEYWORD_1,KEYWORD_2,KEYWORD_3
 type "%filename1%"|FIND "%%k"
:: remove variables starting $ or #
For %%b IN ($ #) DO FOR  /F "delims==" %%a In ('set %%b 2^>Nul') DO SET "%%a="
:: action - WRITE(#)
FOR /f "usebackqtokens=1*delims=- " %%a IN ("%filename1%") DO (
 REM interested in "pattern - WRITE(#) start/end"
 FOR /f "tokens=1-3*delims=-()" %%A IN ("%%b") DO (
  REM %%A is action, %%B "WRITE" %%C # %%D " start"/" end"
  IF "%%B"==" WRITE" IF "%%D"==" start" SET "$%%C %%A$=%%a"
  IF "%%B"==" WRITE" IF "%%D"==" end" SET "#%%C %%A#=%%a"
  CALL :formatch&CALL :report "WRITE(%%C)"
SET $ 2>nul
SET # 2>nul
:: remove variables starting $ or #
For %%b IN ($ #) DO FOR  /F "delims==" %%a In ('set %%b 2^>Nul') DO SET "%%a="

FOR /f "usebackqtokens=1*delims=- " %%a IN ("%filename1%") DO (
 REM interested in "ACTION_A/ACTION_B elapsed time"
 FOR /f "tokens=1*delims=- " %%A IN ("%%b") DO (
  REM %%A is action, %%B remainder of line
  IF "%%A"=="ACTION_A" SET "$1$=%%a"&SET "_actiona=%%B"
  IF "%%A"=="ACTION_B" SET "#1#=%%a"
  CALL :formatch
  CALL :report "%%_actiona%% %%B"
SET $ 2>nul
SET # 2>nul


:: See whether a $something$ AND a #something# exist and report if so
SET "elapsed="
FOR /f "tokens=1,2delims=$=" %%m IN ('set $ 2^>nul') DO (
 IF DEFINED #%%m# (
  CALL :elapsed %%n %%#%%m#%%
  SET "#%%m#="
  SET "$%%m$="
GOTO :eof

IF DEFINED elapsed ECHO %elapsed% %~1
GOTO :eof

:: get %2 - %1 both in HH:mm.ss.ttt format
FOR /f "tokens=1-4delims=:." %%w IN ("%2") DO (SET /a hh=2%%w&SET /a mm=2%%x&SET /a ss=2%%y&SET /a ttt=2%%z)
FOR /f "tokens=1-4delims=:." %%w IN ("%1") DO (SET /a hh-=1%%w&SET /a mm-=1%%x&SET /a ss-=1%%y&SET /a ttt-=1%%z)
:: compensate for "negatives"
IF %ttt% lss 1000 set/a ttt+=1000&set/a ss-=1
IF %ss% lss 100 set/a ss+=60&set/a mm-=1
IF %mm% lss 100 set/a mm+=60&set/a hh-=1
IF %hh% lss 100 set/a hh+=24
SET "elapsed=%hh:~-2%:%mm:~-2%:%ss:~-2%.%ttt:~-3%"
GOTO :eof

GOTO :eof

You would need to change the setting of sourcedir to suit your circumstances.
I used a file named q40441783.txt containing your data for my testing.

Interesting exercise.

The first two steps are fairly obvious. I included KEYWORD_3 to ensure a correct report was produced for "not found". Note that you state 2 occurrences for KEYWORD_1. Actually, in your posted data there are 3.

The next step required some explanation. First thing I do is to ensure that there are no variables starting # or $.

Next- analyse each line, splitting first on the first - or Space and then processing the part beyond the first delimiter-sequence tokenising on -() so that the tokens are as described in the REM statement. We then set the variable $...$ or #...# to the time in %%a. ... here is the unique part of the log entry - the number and action. Then check whether there is both a $...$ and #...# for the same ... and if so, clear the $#...$# variables, calculate the elapsed time, reconstruct the line and report.

The elapsed-time calculation prepends 2 to the start of each of the variables to ensure that they don't start 0 and potentially get treated as octal. Then pull the same trick with subtracting the start time, using a prepended 1 to produce a result that should be 3 digits (4 for ms). If fewer digits are detected, then we need to add an appropriate number and deduct one from the next-higher time element.

The processing for the action_a/_b timing is essentially the same, but records the start/end times in $!$/#1# since there is no indication of the nature of the strings action_a and action_b and we're forced to assume that the appropriate events don't overlap.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download