Metahuman Metahuman - 7 months ago 16
Python Question

Parse updateinfo.xml

I have been trying to parse the Amazon updateinfo.xml file for my university project in Python. An example file is as follows:



<?xml version="1.0" ?>
<updates>
<update author="linux-security@amazon.com" from="linux-security@amazon.com" status="final" type="security" version="1.4">
<id>AL2012-2014-001</id>
<title>Amazon Linux 2012.03 - AL2012-2014-001: important priority package update for libxml2</title>
<issued date="2014-10-19 15:48" />
<updated date="2014-10-19 15:48" />
<severity>important</severity>
<description>Package updates are available for Amazon Linux that fix the following vulnerabilities:
CVE-2012-5134:
A heap-based buffer underflow flaw was found in the way libxml2 decoded certain entities. A remote attacker could provide a specially-crafted XML file that, when opened in an application linked against libxml2, would cause the application to crash or, potentially, execute arbitrary code with the privileges of the user running the application.
</description>
<references>
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5134" id="CVE-2012-5134" title="" type="cve" />
<reference href="https://rhn.redhat.com/errata/RHSA-2012:1512.html" id="RHSA-2012:1512" title="" type="redhat" />
</references>
<pkglist>
<collection short="amazon-linux">
<name>Amazon Linux</name>
<package arch="x86_64" epoch="0" name="libxml2-debuginfo" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-debuginfo-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2-devel" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-devel-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2-static" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-static-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2-python" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-python-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
</collection>
</pkglist>
</update>
<update author="linux-security@amazon.com" from="linux-security@amazon.com" status="final" type="security" version="1.4">
<id>AL2012-2015-088</id>
<title>Amazon Linux 2012.03 - AL2012-2015-088: medium priority package update for gnutls</title>
<issued date="2015-07-29 20:47" />
<updated date="2015-07-29 20:47" />
<severity>medium</severity>
<description>Package updates are available for Amazon Linux that fix the following vulnerabilities:
CVE-2015-0294:
It was discovered that GnuTLS did not check if all sections of X.509 certificates indicate the same signature algorithm. This flaw, in combination with a different flaw, could possibly lead to a bypass of the certificate signature check.

CVE-2015-0282:
It was found that GnuTLS did not verify whether a hashing algorithm listed in a signature matched the hashing algorithm listed in the certificate. An attacker could create a certificate that used a different hashing algorithm than it claimed, possibly causing GnuTLS to use an insecure, disallowed hashing algorithm during certificate verification.

CVE-2014-8155:
It was found that GnuTLS did not check activation and expiration dates of CA certificates. This could cause an application using GnuTLS to incorrectly accept a certificate as valid when its issuing CA is already expired.
</description>
<references>
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-8155" id="CVE-2014-8155" title="" type="cve" />
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-0282" id="CVE-2015-0282" title="" type="cve" />
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-0294" id="CVE-2015-0294" title="" type="cve" />
<reference href="https://rhn.redhat.com/errata/RHSA-2015:1457.html" id="RHSA-2015:1457" title="" type="redhat" />
</references>
<pkglist>
<collection short="amazon-linux">
<name>Amazon Linux</name>
<package arch="x86_64" epoch="0" name="gnutls-debuginfo" release="18.14.al12" version="2.8.5">
<filename>Packages/gnutls-debuginfo-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls-devel" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-devel-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls-utils" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-utils-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls-guile" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-guile-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-debuginfo" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-debuginfo-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-devel" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-devel-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-guile" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-guile-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-utils" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-utils-2.8.5-18.14.al12.i686.rpm</filename></package>
</collection>
</pkglist>
</update>
</updates>





I am trying to wean out details such as the arch type, name, its release version and the file name without packages.

My question is, how do I do this to a file with some 300 of the above entries efficiently? With my limited knowledge about Python, I can manage to get this out from a single entry. But with so many (700+) entries (1.5G file size), when I try to run it in a for loop, it consumes a lot of resources and the contains garble. How do I do this?

Answer

Use xml.etree module. As far as my experience was when working with xml.etree the performance is good.

For example:

import xml.etree.ElementTree as ET
tree = ET.parse('updateinfo.xml')
root = tree.getroot()
updates = root.findall('update')

for update in updates:
  packages=update.find('pkglist').find('collection').findall('package')
  for package in packages:
    print(package.attrib['arch'], package.attrib['name'], package.attrib['release'], package.find('filename').text.replace('Packages/',''))

This results in the following output (ran with python3):

x86_64 libxml2-debuginfo 10.23.26.ec2 libxml2-debuginfo-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2-devel 10.23.26.ec2 libxml2-devel-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2 10.23.26.ec2 libxml2-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2-static 10.23.26.ec2 libxml2-static-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2-python 10.23.26.ec2 libxml2-python-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 gnutls-debuginfo 18.14.al12 gnutls-debuginfo-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls 18.14.al12 gnutls-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls-devel 18.14.al12 gnutls-devel-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls-utils 18.14.al12 gnutls-utils-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls-guile 18.14.al12 gnutls-guile-2.8.5-18.14.al12.x86_64.rpm
i686 gnutls-debuginfo 18.14.al12 gnutls-debuginfo-2.8.5-18.14.al12.i686.rpm
i686 gnutls-devel 18.14.al12 gnutls-devel-2.8.5-18.14.al12.i686.rpm
i686 gnutls-guile 18.14.al12 gnutls-guile-2.8.5-18.14.al12.i686.rpm
i686 gnutls 18.14.al12 gnutls-2.8.5-18.14.al12.i686.rpm
i686 gnutls-utils 18.14.al12 gnutls-utils-2.8.5-18.14.al12.i686.rpm