Amit Kumar Amit Kumar - 2 months ago 11
Java Question

how to use split or other function for extracting information from text file having xml tags

Hi I do have flat text files having data in form like this

<PersonName> Ian </PersonName> <OrgName> Cum Sociis Natoque Limited</OrgName>
<PersonName> Camilla </PersonName> <OrgName> Lorem Corporation </OrgName>
<PersonName> Addison </PersonName> <OrgName> Tempus Corp. </OrgName>
<PersonName> Arsenio </PersonName> <OrgName> Id LLP </OrgName>


I want the final outcome like this:

Ian: PersonName
Cum Sociis Natoque Limited: OrgName
Camilla: PersonName
.... so on


does anyone have any insights

Answer

Assuming that your file is really a plain text file not an XML file, you could use a regular expression to extract the text content between the XML tags, as next:

Pattern pattern = Pattern.compile("<([^>]+)>([^<]*)</[^>]+>");
try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
    String line;
    while ((line = reader.readLine()) != null) {
        System.out.println(line);
        Matcher matcher = pattern.matcher(line);
        while (matcher.find()) {
            System.out.printf("%s: %s ", matcher.group(2).trim(), matcher.group(1));
        }
        System.out.println();
    }
}

Output:

<PersonName> Ian </PersonName> <OrgName> Cum Sociis Natoque Limited</OrgName>
Ian: PersonName Cum Sociis Natoque Limited: OrgName 
<PersonName> Camilla </PersonName> <OrgName> Lorem Corporation </OrgName>
Camilla: PersonName Lorem Corporation: OrgName 
<PersonName> Addison </PersonName> <OrgName> Tempus Corp. </OrgName>
Addison: PersonName Tempus Corp.: OrgName 
<PersonName> Arsenio </PersonName> <OrgName> Id LLP </OrgName>
Arsenio: PersonName Id LLP: OrgName 

In java 8, it would be:

Pattern pattern = Pattern.compile("<([^>]+)>([^<]*)</[^>]+>");
try (Stream<String> stream = Files.lines(Paths.get(filePath))) {
    stream.forEach(
        line -> {
            System.out.println(line);
            Matcher matcher = pattern.matcher(line);
            while (matcher.find()) {
                System.out.printf("%s: %s ", matcher.group(2).trim(), matcher.group(1));
            }
            System.out.println();
        }
    );
}