Shlomi Schwartz Shlomi Schwartz - 10 days ago 4x
Node.js Question

Inverse match for HTML tags

Using NodeJS, I have the following regex:

which matches HTML tags:
(Live Demo)
enter image description here

I would like to inverse the match so it will capture the text, I've tried negative lookahead approach, with no luck.

I'm avoiding split method, because I need the indexes of the match

Is it possible with JS?


Is it possible with JS?

No. HTML can be arbitrarily nested, which means you need recursion in order to consume it using regex - something which JavaScript regex doesn't have.

Assuming you can ditch JS and use a language that supports PCRE, this monstrous bunch of unintelligible characters written by Cthulhu regex does the trick (mandatory regex101 link) (note that it doesn't deal with CDATA and can't know whether or not the tags are balanced):


Here's how it works:

  • <!--[\s\S]*?-->| is for preventing comments from causing false positives
  • <([a-z]+)(?:\s\S+?=(["']|)[\s\S]*?\2)*> is the opening tag, where
    • ([a-z]+) is the tag name (note the capturing group - we'll need it in the closing tag)
    • (?:\s\S+?=(["']|)[\s\S]*?\2)* is the attributes, where
      • \s is the whitespace character that separates attributes from tag name and each other
      • \S+?= is the attribute name followed by an equals sign (note the lazy quantifier - we need it because \S includes =)
      • (["']|)[\s\S]*?\2 is the value, that can be enclosed in double quotes, single quotes, or nothing
  • ((?:[\s\S]*?(?R)?)*) is the text between tags (note the capturing group - it's exactly what you need and will appear as group 3), where (?R)? makes the regex able to deal with nested constructs
  • <\/\1> is the closing tag, where \1 is the tag name (remember the capturing group in the opening tag)