SpaceTrucker SpaceTrucker - 5 months ago 52
PowerShell Question

What regex should be used to match a multiline log message?

I'm writing a batch file that processes a log file of my application.

The log file may contain messages whose start match the regex

followed by some consecutive lines that I need to find. The end of a log message will be denoted by the next match of the regex

Currently I'm using the Regex
to find such messages. But the performance is very poor as it is currently already running multiple minutes for a few MB log file.

The complete batch file I'm using is:

@Echo off

powershell -Command "& {[System.Text.RegularExpressions.RegEx]::Matches([System.IO.File]::ReadAllText('application.log'), '(?m)^.{24}\[ERROR(.*\r?\n?.)*?^.{24}\[[A-Z]') | Set-Content result.txt}"

What regex should I use to match the log messages as described above?


The point is that your regex contains a (.*\r?\n?.)*? section inside, containing nested optional (that is, matching an empty text) subpatterns. Once quantified in a group, they have the regex engine try a lot of combinations before admitting there is no match, thus, leading to catastrophical backtracking or timeout issues.

One of the solutions is just to use lazy dot matching pattern with the DOTALL modifier:


See the regex demo

The .NET regex engine handles the subpattern much better than PCRE, Python re, JavaScript.

However, lazy matching costs performance, and it is best practice to unroll it. I suggest


See another regex demo

Note that these 2 are equivalent in what they match, but differ in how they match. While the first tries to match the trailing part of the pattern and expanding 1 char by one upon failure, the unrolled pattern just grabs text portions up to a newline, and all newlines that have no 24 non-newline symbols followed with [ and an uppercase ASCII letter, which is faster. test:

enter image description here