Apricot Apricot - 11 months ago 75
Python Question

Merge specific child nodes in xml using python

I want to merge certain sub elements of xml file together. The following is the format I have:

<?xml version='1.0' encoding='ISO-8859-1'?><?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><name>imglab dataset</name><comment>Created by imglab tool.</comment><images>
<image file='/home/user126043/Documents/testimages/9935.jpg'>
<box top='329' left='510' width='385' height='534'>
<label>Pirelli
</label></box></image>
<image file='/home/user126043/Documents/testimages/9935.jpg'>
<box top='360' left='113' width='440' height='147'>
<label>Pirelli
</label></box></image>
<image file='/home/user126043/Documents/testimages/9921.jpg'>
<box top='329' left='510' width='385' height='534'>
<label>Pirelli
</label></image>
</images></dataset>


In the above xml I have the box coordinates of image 99.jpg specified twice which I want to merge into one. I want to remove the
<image>
tag that appears repeatively for the same image and want to merge all the box coordinates for every single image within its own image tags. I have never worked with XML and hence I am not sure if the definitions that I use is right here or not. The desired output is:

<?xml version='1.0' encoding='ISO-8859-1'?><?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><name>imglab dataset</name><comment>Created by imglab tool.</comment><images>
<image file='/home/user126043/Documents/testimages/9935.jpg'>
<box top='329' left='510' width='385' height='534'>
<label>Pirelli
</label></box>
<box top='360' left='113' width='440' height='147'>
<label>Pirelli
</label></box></image>
<image file='/home/user126043/Documents/testimages/9921.jpg'>
<box top='329' left='510' width='385' height='534'>
<label>Pirelli
</label></image>
</images></dataset>

Answer Source

You can try with module xml.etree.ElementTree :

import xml.etree.ElementTree as ET
tree = ET.parse('dataset.xml')
root = tree.getroot()
file_dict = dict()
for image in root.iter('image'):    
    file_str = image.get('file')    
    if file_str in file_dict:
        root.find('images').remove(image) #remove the duplicate one
        root.find('images').find("./image[@file='"+file_str+"']").append(image.find('box')) #append duplicated subelement to merge with same image element
    else:
        file_dict[file_str]=image
print(ET.tostring(root))

The new root will be:

<dataset><images>
<image file="/home/user126043/Documents/testimages/9941.jpg">
<box height="147" left="113" top="360" width="440">
<label>Pirelli
</label></box></image>
<image file="/home/user126043/Documents/testimages/99.jpg">
<box height="276" left="247" top="160" width="228">
<label>Pirelli
</label></box><box height="276" left="247" top="439" width="506">
<label>Pirelli
</label></box></image>
</images></dataset>
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download