Lee Lee - 5 months ago 15
Bash Question

Slice keywords from log text files

I have a big log file with lines as

[2016-06-03T10:03:12] No data: TW.WA2


,

[2016-06-03T11:03:02] wrong overlaps: XW.W12.HHZ.2007.289


and as

[2016-06-03T14:05:26] failed to correct YP.CT02.HHZ.2012.334 because No matching response.


Each line consists of a timestamp, a reason for the logging and a keyword composed of some substrings connected by dots (
TW.WA2
,
XW.W12.HHZ.2007.289
and
YP.CT02.HHZ.2012.334
in above examples).

The format of the keywords of a specific type is fixed (substrings are joined by fixed number of dots).

The substrings are composed of letters and digits (0-5 chars, but not all substrings can be empty, generally only one at maximum, e.g.,
XW.WTA12..2007.289
).

I want to


  • extract the keywords

  • save different types of keywords uniqued to separated files



Currently I tried
grep
, but only the classification is done.


  • grep "wrong overlaps" logfile > wrong_overlaps

  • grep "failed to correct" logfile > no_resp

  • grep "No data" logfile > no_data



In
no_data
, the contents are expected as like

AW.AA1
TW.WA2
TW.WA3
...


In
no_resp
, the contents are expected as like

XP..HHZ.2002.334
YP.CT01.HHZ.2012.330
YP.CT02.HHZ.2012.334
...


However, the simple
grep
commands above save the full lines. I guess I need regex to extract the keywords?

Answer

Assuming a keyword is defined by containing period and surrounded by letters and digits, then the followed regex will match all keywords:

% grep -oE '\w+(\.\w+)+' data
TW.WA2
XW.W12.HHZ.2007.289
YP.CT02.HHZ.2012.334

-o will print the matches only. And -E enables Extended Regular Expressions

This will however not make it possible for you to split it into multiply files, eg: Creating a file wrong_overlaps that contains all lines with wrong overlaps.

You can use -P to enable Perl Compatible Regular Expressions which support lookbehinds:

% grep -oP '(?<=wrong overlaps: )\w+(\.\w+)+' data
XW.W12.HHZ.2007.289

But note that PCRE doesn't support variable length lookbehinds so you will need to type out the full pattern before, eg:

something test string: ABC:DEF

ABC:DEF Can be extracted with:

(?<=test string: )\w+(\.\w+)+

But not

(?<=test string)\w+(\.\w+)+