sashoalm sashoalm - 2 months ago 13
C++ Question

QRegExp match lines containing N words all at once, but regardless of order (i.e. Logical AND)

I have a file containing many lines of text, and I want to match only those lines that contain a number of words. All words must be present in the line, but they can come in any order.

So if we want to match one, two, three, the first 2 lines below would be matched:

three one four two <-- match
four two one three <-- match
one two four five
three three three


Can this be done using QRegExp (without splitting the text and testing each line separately for each word)?

Answer

Yes it is possible. Use a lookahead. That will check the following parts of the subject string, without actually consuming them. That means after the lookahead is finished the regex engine will jump back to where it started and you can run another lookahead (of course in this case, you use it from the beginning of the string). Try this:

^(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)[^\r\n]*$

The negated character classes [^\r\n] make sure that we can never look past the end of the line. Because the lookaheads don't actually consume anything for the match, we add the [^\r\n]* at the end (after the lookaheads) and $ for the end of the line. In fact, you could leave out the $, due to greediness of *, but I think it makes the meaning of the expression a bit more apparent.

Make sure to use this regex with multi-line mode (so that ^ and $ match the beginning of a line).

EDIT:

Sorry, QRegExp apparently does not support multi-line mode m:

QRegExp does not have an equivalent to Perl's /m option, but this can be emulated in various ways for example by splitting the input into lines or by looping with a regexp that searches for newlines.

It even recommends splitting the string into lines, which is what you want to avoid.

Since QRegExp also does not support lookbehinds (which would help emulating m), other solutions are a bit more tricky. You could go with

(?:^|\r|\n)(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)([^\r\n]*)

Then the line you want should be in capturing group 1. But I think splitting the string into lines might make for more readable code than this.

Comments