José M. Carnero José M. Carnero - 5 months ago 19
HTML Question

Regex pattern to match hashtag, but not in HTML attributes

I'm trying to extract hashtags in an HTML text with the regular expression

#([a-z0-9_]+)
, but with troubles in HTML attributes.

For example in the HTML text:

hola que tal with #hash1.
hola que tal with #hash2

y <a href="hola.que.tal#hash3"> para #hash4. </a>


I want to recover "hash1", "hash2" and "hash4" but not "hash3".

I tried to resolve it with lookarounds, with the following expression:

(?<!<)#([a-z0-9_]+)(?!.*?>)


but without success.

How I can do it with a single regular expression?

Answer

This should work

/#[a-z0-9_]+(?![^<]*>)/

See http://www.regexpal.com/?fam=95144

What the negative lookahead does is makes sure that there is a < between the hashtag and the next >.