Braden Braden - 3 years ago 101
Python Question

Extracting data from XML into dictionary (1 line as key, next line as item)

I have a

XML
file with a header of summary information and a main data table after. I've got the main data table out and into a
pd.df
, but now I want to extract parts of the header information into a dictionary.

Example of
XML
file:


<Workbook>
<Worksheet>
<Tables>
<Row>
<Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Administrative Data</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">ID</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">B013</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Title</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Mr</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Last Name</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Data</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">First Name</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">Test</Data></Cell>
</Row>
<Row/>
<Row/>
<Row>
<Cell ss:StyleID="HeadTableTitle" ss:MergeAcross="1"><Data ss:Type="String">Biological and Medical Baseline Data</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Height</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">176 cm</Data></Cell>
</Row>
<Row>
<Cell ss:StyleID="HeadTableParameterName" ss:MergeAcross="1"><Data ss:Type="String">Weight</Data></Cell>
<Cell ss:StyleID="HeadTableParameterValue" ss:MergeAcross="7"><Data ss:Type="String">56.9 kg</Data></Cell>
</Row>
</Tables>
</Worksheet>
</Workbook>


What I want to be able to do is extract the data in the 'Administrative Data' section to a dictionary with the first 'cell' value as the key (if I can remove the spaces that would be awesome), and the second 'cell' value as the item. This then needs to be repeated so the data in the 'Biological and Medical Baseline Data' is held in a separate dictionary. Dictionary names can be whatever (e.g. 'subject' and 'biomed')

Current code to parse the
XML
file and access the tags:


from lxml import etree

f_path = 'data store/cortex_full.xml' # enter path of xml file

# open and parse xml file
with open(f_path, 'r', encoding='utf-8') as f: # set encoding to utf-8 for mac
root = etree.parse(f)

namespaces = {'o': 'urn:schemas-microsoft-com:office:office',
'x': 'urn:schemas-microsoft-com:office:excel',
'ss': 'urn:schemas-microsoft-com:office:spreadsheet'}


ws = root.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0:
tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
if len(tables) > 0:
rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
for row in rows:
cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)


Any suggestions on how to approach this? If a dictionary is not optional, also happy to work with other suggestions.

Answer Source

Make sure you declare the below well before

subject={}
bio={}
d=None  #If this doesn't work then use d={}

And consider replacing

for row in rows:
   cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)

with

        for row in rows:
            cells = row.xpath('./ss:Cell', namespaces=namespaces)
            if(len(cells)==2):
              key=None
              item=None
              for cell in cells:
                if(cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}StyleID']=="HeadTableParameterName"):
                  key=cell.xpath('./ss:Data', namespaces=namespaces)[0].text.strip()
                else:
                  item=cell.xpath('./ss:Data', namespaces=namespaces)[0].text.strip()
              if(not(key==None or item==None)):
                d[key]=item
            elif len(cells)==1:
              if(cells[0].attrib['{urn:schemas-microsoft-com:office:spreadsheet}StyleID']=='HeadTableTitle'):
                if(cells[0].xpath('./ss:Data', namespaces=namespaces)[0].text=='Biological and Medical Baseline Data'):
                  d=bio
                else:
                  d=subject
print(bio)
print(subject)

Although not necessary I've put in some checks just to give an idea, but you can extend the checks to make it more robust.

Also I've a working version here.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download