Tony Nesavich Tony Nesavich - 5 months ago 17
Bash Question

Extract entire elements from large XML to individual files

I am looking to extract elements from a large XML file to individual files preferably with a command or script.

The issue is that the XML is not properly formed and is proprietary and whenever I try to use XML utilities like twig or xmlstarlet the data gets munged improperly and special characters get messed up. Hence my need for a simply regex match and direct copy of exactly what matches to a file (iteratively) for each match where the file names iterate to say match1.xml match2.xml

Example XML source:

...
  <testcase id="001" kind="bvt">
    <inputs>
      <arg1>4</arg1>
      <arg2>7</arg2>
    </inputs>
    <expected>11.00</expected>
  </testcase>
  <testcase id="002" kind="drt">
    <inputs>
      <arg1>9</arg1>
      <arg2>6</arg2>
    </inputs>
    <expected>15.00</expected>
  </testcase>
  <testcase id="003" kind="bvt">
    <inputs>
      <arg1>5</arg1>
      <arg2>8</arg2>
    </inputs>
    <expected>13.00</expected>
  </testcase>
...


Desired output:
Content of match1.xml:

...
  <testcase id="001" kind="bvt">
    <inputs>
      <arg1>4</arg1>
      <arg2>7</arg2>
    </inputs>
    <expected>11.00</expected>
  </testcase>
...


Content of match2.xml:

..
  <testcase id="002" kind="drt">
    <inputs>
      <arg1>9</arg1>
      <arg2>6</arg2>
    </inputs>
    <expected>15.00</expected>
  </testcase>
...


and so on.

Here is some regex I put together that will work. All I need is an assist on putting together a loop in a bash script to copy each match / element to its own file.

(<testcase*[\s\S]*?<\/testcase>)

Answer

Figured it out! Python has a great regex module "re" that I used to solve this.

Below is the python I used. In this case the element was everything (including line breaks carriage returns, line feeds special characters etc.) until and includes the element tag (as needed in this use case).

Every object element gets incrementally written to it's own package-0000 - package-nnnn file and the content is exactly what was in the original file (no munging issues)! :)

import re

from re import match
pattern = re.compile(r'(<object>[\s\S]*?<\/object>)', flags=re.S)
with open("/temp/Test/package1.xml", 'r') as f:
    matches = pattern.findall(f.read())

for i, match in enumerate(matches):
    with open("/temp/Test/package-{0:04d}.xml".format(i), 'w') as nf:
        nf.write(match)
Comments