Jose Berciano Jose Berciano - 1 month ago 18
Java Question

Remove empty tags at XML using Java

I'm giving some functionality to a servlet, one of the things I want to do is, when receiving the InputStream (which is basically a PDF document parsed into an XML format) set that data to a String object, then I try to delete all the empty tags, but I haven't got any good result so far:

This is the data the servlet is receiving



<form1>
<GenInfo>
<Section1>
<EmployeeDet>
<Title>999990000</Title>
<Firstname>MIKE</Firstname>
<Surname>SPENCER</Surname>
<CoName/>
<EmpAdd>
<Address><Add1/><Add2/><Town/><County/><Pcode/></Address>
</EmpAdd>
<PosHeld>DEVELOPER</PosHeld>
<Email/>
<ConNo/>
<Nationality/>
<PPSNo/>
<EmpNo/>
</EmployeeDet>
</Section1>
</GenInfo>
</form1>


The final result should be looking like this:



<form1>
<GenInfo>
<Section1>
<EmployeeDet>
<Title>999990000</Title>
<Firstname>MIKE</Firstname>
<Surname>SPENCER</Surname>
<PosHeld>DEVELOPER</PosHeld>
</EmployeeDet>
</Section1>
</GenInfo>
</form1>


My apologies if it is a repeated question but I did some research over similar posts and none of them could provide me the correct approach, that's why I am asking you in a separate post.

Thank you in advance.

Answer Source

Here's regex way of doing what you're wanting. I'm sure there are probably some "edge" cases that I'm not thinking of, but sometimes you can't tell when you use regex. Also, a DOM parser is probably the best way to do this.

public static void main(String[] args) throws Exception {
    String[] patterns = new String[] {
        // This will remove empty elements that look like <ElementName/>
        "\\s*<\\w+/>", 
        // This will remove empty elements that look like <ElementName></ElementName>
        "\\s*<\\w+></\\w+>", 
        // This will remove empty elements that look like 
        // <ElementName>
        // </ElementName>
        "\\s*<\\w+>\n*\\s*</\\w+>"
    };

    String xml = "    <form1>\n" +
                    "        <GenInfo>\n" +
                    "            <Section1>\n" +
                    "                <EmployeeDet>\n" +
                    "                    <Title>999990000</Title>\n" +
                    "                    <Firstname>MIKE</Firstname>\n" +
                    "                    <Surname>SPENCER</Surname>\n" +
                    "                    <CoName/>\n" +
                    "                    <EmpAdd>\n" +
                    "                        <Address><Add1/><Add2/><Town/><County/><Pcode/></Address>\n" +
                    "                    </EmpAdd>\n" +
                    "                    <PosHeld>DEVELOPER</PosHeld>\n" +
                    "                    <Email/>\n" +
                    "                    <ConNo/>\n" +
                    "                    <Nationality/>\n" +
                    "                    <PPSNo/>\n" +
                    "                    <EmpNo/>\n" +
                    "                </EmployeeDet>\n" +
                    "            </Section1>\n" +
                    "        </GenInfo>\n" +
                    "    </form1>";

    for (String pattern : patterns) {
        Matcher matcher = Pattern.compile(pattern).matcher(xml);
        xml = matcher.replaceAll("");
    }

    System.out.println(xml);
}

Results:

    <form1>
        <GenInfo>
            <Section1>
                <EmployeeDet>
                    <Title>999990000</Title>
                    <Firstname>MIKE</Firstname>
                    <Surname>SPENCER</Surname>
                    <PosHeld>DEVELOPER</PosHeld>
                </EmployeeDet>
            </Section1>
        </GenInfo>
    </form1>