Duck Duck - 6 months ago 46
HTML Question

Java regex pattern matching to multiple of same tags

Question:

How do I successfully match

<tag TAG1>SOME VALUE</tag TAG1><tag TAG1>ANOTHER VALUE</tag TAG1>
as 2 separate values?

Background:

I am attempting to match a string as such
<tag TAG1>SOME VALUE</tag TAG1><tag TAG1>ANOTHER VALUE</tag TAG1>

Where
TAG1
is the name of that specific tag (multiple tags can have the same name but different values) and
SOME VALUE
,
ANOTHER VALUE
are different values enclosed by the tags.

So far I am able to match to one pair of tags as such
<tag TAG1>SOME VALUE</tag TAG1>
using the regex pattern
<\\s*tag\\s*.+\\s*>(.*)</\\s*tag\\s*.+\\s*>


The example above is a worst-case scenario with no characters separating the end of the first tag and the start of the second. My problem is when I run
find()
with my regex string, I get both tags as if they were one tag.

The problem is with the wildcard in between the tags
(.*)
because it doesn't exclude the end/start of a tag. I need the wildcard match because any character (including
\n
) could be inside the tags. I am also using
Pattern.DOTALL
to successfully match 1 tag with newlines inside.

Answer

Here is how you could do it:

String value = "<tag TAG1>SOME VALUE</tag TAG1><tag TAG1>ANOTHER VALUE</tag TAG1>";
Pattern pattern = Pattern.compile("<\\s*tag\\s*[^>]+\\s*>([^(</)]*)</\\s*tag\\s*[^>]+\\s*>");
Matcher matcher = pattern.matcher(value);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Output:

<tag TAG1>SOME VALUE</tag TAG1>
<tag TAG1>ANOTHER VALUE</tag TAG1>