rahul gulati rahul gulati - 10 months ago 62
Python Question

Extract specific XML tags Values in python

I have a XML file which contains tags like these.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DataFlow id="ABC">
<Flow name="flow4" type="Ingest">
<Ingest dataSourceName="type1" tableName="table1">
<DataFlow id="MHH" dependsOn="ABC">
<Flow name="flow5" type="Reconcile">
<ReconcileColumns mode="required">
<Flow name="output" type="Export" format="Native">
<Table publishToSQLServer="true">

I want to process this XML in python using Python Minimal DOM implementation.
I need to extract information in DataSet Tag only when the Flow type in “Reconcile".

For Example:

If my Flow Type is "Reconcile" then i need to go to next Flow tag named "output" and extract values of DataSetRef,DataSource and Date tags.

So far i have tried below mentioned Code but i am getting blank values in all may fields.


from xml.dom.minidom import parse

import xml.dom.minidom

# Open XML document using minidom parser

DOMTree = xml.dom.minidom.parse("Store.xml")

collection = DOMTree.documentElement

#if collection.hasAttribute("DataFlows"):

# print "Root element : %s" % collection.getAttribute("DataFlows")

pretty = DOMTree.toprettyxml()

print "Collectio: %s" % collection

dataflows = DOMTree.getElementsByTagName("DataFlow")

# Print detail of each movie.

for dataflow in dataflows:

print "*****dataflow*****"

if dataflow.hasAttribute("dependsOn"):

print "Depends On is present"

flows = DOMTree.getElementsByTagName("Flow")

print "flows"

for flow in flows:

print "******flow******"

if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile":

flowByReconcileType = flow.getAttribute("type")

TagValue = flow.getElementsByTagName("DataSet")

print "Tag Value is %s" % TagValue

print "flow type is: %s" % flowByReconcileType

From there onwards i need to pass these 3 values extracted above to Unix Shell scripts to process some directories.
Any Help would be appreciated.

Answer Source

First of all check if your XML is well formatted. You are missing a root tag and you got wrong double quotes for example here <Flow name=“flow4" type="Ingest">

IN your code you are correctly grabbing the dataflows.

You dont need to query the DOMTree again for the flows, you can check every dataflow's flow by quering like this:

flows = dataflow.getElementsByTagName("Flow")

Your condition if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile": looks ok to me, so in order to get the next flow item you can do something like this always checking your index is inside the array.

for index, flow in enumerate(flows):
    if flow.hasAttribute("type") and flow.getAttribute("type") == "Reconcile":
        if index + 1 < len(flows):
            your_flow = flows[index + 1]