arabian_albert arabian_albert - 7 months ago 136
Java Question

Regex for extracting text between Tags but not the tags

I have the following Text:

<Data>
<xpath>/Temporary/EIC/SpouseSSNDisqualification</xpath>
<Gist>AllConditionsTrue</Gist>
<Template>
<Text id="1">Your spouse is required to have a Social Security number instead of an ITIN to claim this credit. This is based on the IRS rules for claiming the Earned Income Credit.</Text>
</Template>
</Data>
<Data>
<xpath>/Temporary/EIC/SpouseSSNDisqualification</xpath>
<Gist>AllConditionsTrue</Gist>
<Template>
<Text id="1">Your spouse has the required Social Security number instead of an ITIN to claim this credit. This is based on the IRS rules for claiming the Earned Income Credit.</Text>
</Template>
</Data>


I would like to extract the data between the xpath tags but not the tags themselves.

Output should be:

/Temporary/EIC/SpouseSSNDisqualification


/Temporary/EIC/SpouseSSNDisqualification


This regex seems to give me all the text including the 'xpath' tags which I don't want:

<NodeID>(.+?)<\/NodeID>


Edit:

Here is my Java code but I am not sure if this would add value to my question:

try {
String xml = FileUtils.readFileToString(file);
Pattern p = Pattern.compile("<xpath>(.+?)<\\/xpath>");
Matcher m = p.matcher(xml);

while(m.find()) {
System.out.println(m.group(0));
}
}

Answer

Easy. It's because you take all the result, not just the group 1 value.

String nodestr = "<xpath>/Temporary/EIC/SpouseSSNDisqualification</xpath>";
String regex = "<xpath>(.+?)<\/xpath>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(nodestr);
if (matcher.matches()) {
    String tag_value = matcher.group(1); //taking only group 1
    System.out.println(tag_value); //printing only group 1
}