Azaghal Azaghal - 1 month ago 10
Perl Question

Regex matching specific value after a certain number of tabs

In a tab delimited text file, I would like to match only lines containing the "1" value right after the 24th tab.

Right now, the regex I have seems to match what I want, but breaks when the line doesn't match.

Could you help me improving it?

My regex :



/(?:.+?\t){24}1/


Sample input :



INT E_63 0 0 u Le Le DET:ART DET le ?? ADJ SENT DET:ART NOM ADV SENT DET NOM 1 ?? ?? ?? ?? ?? 0 0 0 0 0 1 ?? ?? ?? ?? ?? ??
INT E_63 0 0 u Le Le DET:ART DET le ?? ADJ SENT DET:ART NOM ADV SENT DET NOM 1 ?? ?? ?? ?? ?? 0 0 0 0 0 0 ?? ?? ?? ?? ?? ??


(The first line should match, the second should not.)

Answer

Your regex does not work when there is no match due to catastrophic backtracking as . also matches a tab character. Coupled with the fact that there are more subpatterns after the group with nested quantifiers, and absence of the ^ anchor, the catastrophic backtracking is imminent.

What you need is a negated character class [^\t] and anchor the pattern at the start of the string:

/^(?:[^\t]*\t){24}1/

See the regex demo.

NOTE: To match the 1 as a whole word, you might consider adding \b after it, or a lookahead (?!\S).

Details:

  • ^ - start of a string
  • (?:[^\t]*\t){24} - 24 sequences of
    • [^\t]* - 0+ chars other than a tab char
    • \t - a tab char
  • 1 - a 1 char.