user1769925 user1769925 - 7 months ago 16
Perl Question

Perl Regex match something, but make sure that the match string does not contain a string

I have files with sequences of conversations where speakers are tagged. The format of my files is:

<SPEAKER>John</SPEAKER>
I am John
<SPEAKER>Lisa</SPEAKER>
And I am Lisa


I am now looking to identify the first sequence in each document in which John speaks and Lisa speaks right afterwards (and I then want to then retain the entire part of the document that follows this sequence, including the sequence).

I built this regex:

^.*?(<SPEAKER>John<\/SPEAKER>.*?<SPEAKER>Lisa<\/SPEAKER>.*)


but it of course also captures the case where there is a sequence of speakers is John-Michael-Lisa, i.e. where there is someone speaking between John and Lisa.

How can I get the right match?

Answer

Here is a regex you can use to match what you describe:

(<SPEAKER>John<\/SPEAKER>(?:(?!<SPEAKER>).)*<SPEAKER>Lisa<\/SPEAKER>.*)

And a small demo showing that it works: https://regex101.com/r/iW8vS5/1

However, as both kchinger and owler mentioned, regex probably isn't the best way to do this. A regex solution would likely be significantly slower than a small snippet of code for any long document.