Tyler Marques Tyler Marques - 1 month ago 7
Python Question

Creating dynamic variables for XML Parsing

I'm incredibly new at this, and I've tried searching but nothing I've found has been able to work for me.

I have xml data that looks like this

<datainfo>
<data>
<info State="1" Reason="x" Start="01/01/2016 00:00:00.000" End="01/01/2016 02:00:00.000"></info>
<info State="1" Reason="y" Start="01/01/2016 02:00:00.000" End="01/01/2016 02:01:00.000">
<moreinfo Start="01/01/2016 02:00:00.000" End="01/01/2016 02:00:30.000"/>
<moreinfo Start="01/01/2016 02:00:30.000" End="01/01/2016 02:01:00.000"/>
</info>
<info State="2" Start="01/01/2016 02:01:00.000" End="01/01/2016 02:10:00.000"></info>
...
</data>
</datainfo>


I want find how much time was spent in State {1,2,...} for reason {x,y,...} on a specific day and have that print to a .csv format to be latter read in excel.

The issue I'm having is I can't use static variables because there are hundreds of different states for hundreds of different reasons, and they change constantly.

If I'm not clear please tell me, I am brand new to this and really appreciate any and all help.

Edit: Here is what I currently have, hopefully this will clear up what I'm trying to do.

from datetime import datetime
from lxml import etree as ET

def parseXML(file):
handler = open(file, "r")
tree = ET.parse(handler)
info_list = tree.xpath('//info')
root = tree.getroot()
dictionary = {}
info_len = len(info_list)

for i in range(info_len):
info=root[0][0][i]
info_attribs = info.attrib
end = info_attribs[u'End']
start = info_attribs[u'Start']
FMT = '%m/%d/%Y %H:%M:%S.%f'
tdelta = datetime.strptime(end, FMT) - datetime.strptime(start, FMT)
t_dif = (tdelta.total_seconds()) / 60
try:
dictionary[info_attribs[u'State'] + status_attribs[u'Reason']] = t_dif
except:
continue


I'm trying to iterate through each line, find the State and the reason, then add them to a dictionary. If the entry already exists for that state and reason, I want to add it to the current value.

Let me know if I should provide more info!

Edit #2:

The output I'm looking for would be in the form of a .csv, stuctured like this:

State - Reason, [Total time spent in State 1 for x reason]

Answer

You can use a defaultdict for recurring keys using lists as value, you can also filter the info nodes using an xpath to only find the nodes that have both of the attributes you want so no need for any except:

x = """<datainfo>
   <data>
       <info State="1" Reason="x" Start="01/01/2016 00:00:00.000" End="01/01/2016 02:00:00.000"></info>
       <info State="1" Reason="y" Start="01/01/2016 02:00:00.000" End="01/01/2016 02:01:00.000">
            <moreinfo Start="01/01/2016 02:00:00.000" End="01/01/2016 02:00:30.000"/>
            <moreinfo Start="01/01/2016 02:00:30.000" End="01/01/2016 02:01:00.000"/>
       </info>
       <info State="2" Start="01/01/2016 02:01:00.000" End="01/01/2016 02:10:00.000"></info>
   </data>
</datainfo>"""

from collections import defaultdict
import lxml.etree as et
from datetime import datetime

FMT = '%m/%d/%Y %H:%M:%S.%f'
tree = et.fromstring(x)
d = defaultdict(list)

for node in tree.xpath("//data/info[@Reason and @State]"):
    state = node.attrib["State"]
    reason = node.attrib["Reason"]
    end = node.attrib["End"]
    start = node.attrib[u'Start']
    tdelta = datetime.strptime(end, FMT) - datetime.strptime(start, FMT)
    d[state, reason].append((tdelta.total_seconds()) / 60))

print(d)

Depending on how you want the data to look for recurring keys would determine how you wrote to the csv, if you wanted one row each:

import csv
with open("out.csv", "w") as f:
    wr = csv.writer(f)
    for k,v in d.items():
        for val in v:
            wr.writerow([k] + val)

If you actually want to sum:

d = defaultdict(float)

for node in tree.xpath("//data/info[@Reason and @State]"):
    state = node.attrib["State"]
    reason = node.attrib["Reason"]
    end = node.attrib["End"]
    start = node.attrib[u'Start']
    tdelta = datetime.strptime(end, FMT) - datetime.strptime(start, FMT)
    d[state, reason] += (tdelta.total_seconds()) / 60

Then:

import csv
with open("out.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerows(d.items())