Carver Stone Carver Stone - 3 months ago 15
Python Question

python merge multiple xmls to single CSV

Here's a snippet from a script I'm trying to put together to make my life easier. I have a bunch of XML files from different API sources. They have different items in them and different amount of fields. What they do all have in common is a common field like "clientid".

What I want to do is end up with a CSV that has the combined headers of all the XMLs with their corresponding data. So i need to be able to make sure that all the info from the "clientid" of 12345 adds onto the end of the row for the client id of the same name in the "itemid" pull

item data:

<item>
<id>99899</id>
<client-id>12345</client-id>


part of script:

def parseXML():
### Parse XML and convert to CSV ###
#Get XML Source #
tree = ET.fromstring(getdata)
# open a file for writing
xmlTest01 = open('xmlTest01.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(xmlTest01)
item_head = []

count = 0
for member in tree.findall('item'):
item = []
if count == 0:
id = member.find('id').tag
item_head.append(id)
clientid = member.find('client-id').tag
item_head.append(clientid)

id = member.find('id').text
item.append(id)
clientid = member.find('client-id').text
item.append(clientid)
csvwriter.writerow(item)
xmlTest01.close()


The next set of data has this in it:

<client>
<id>12345</id>
<name>Clients name</name>
<current type="boolean">true</current>
<status>good</status>


So I want to check the row in the previous set of data for the clientid of the same and then add name, current and status to the end of that row.

Any ideas on the best way to do this? I have about 5-7 of these types of files to merge. Should I be trying to combine the files first before converting them to CSV? This might be ok if they all had similar content but they dont.

Desired output which combines values of both xml files:

id,clientid,name,current,status
99899,12345,Clients name,true,good


Update Aug 18

Here is what I've been trying. You can tell by the comments what isn't working.

def parseXML():
### Parse XML and convert to CSV ###
#Get XML Source #
projecttree = ET.fromstring(projectCall)
clienttree = ET.fromstring(clientCall)
projectdettree = ET.fromstring(projDetJoined)
# print (projectdettree)
# print (clienttree)
# print (projecttree)
# exit()
#used if filename tree = ET.parse('filename.xml')
#used if filename root = tree.getroot()
# open a file for writing
with open('Output.csv', 'w') as f:
csvwriter = csv.writer(f, lineterminator = '\n')
# COLUMN HEADERS
csvwriter.writerow(['Pid','Clientid','ClientName','ClientActive','ProjectName','ProjectActive','Billable','BillBy','HourlyRate','Budget',
'OverbudgetNotificationPercentage','CreatedAt','UpdatedAt','StartsOn','EndsOn','Estimate','EstimateBy','Notes','CostBudget','CostRate'])

# for k in projectdettree.iter('team_members/item'):
for i in projecttree.iter('project'):
for j in clienttree.iter('client'):
# if k.find('cid').text == i.find('client-id').text == j.find('id').text:
if i.find('client-id').text == j.find('id').text:
# DATA ROWS
csvwriter.writerow([i.find('id').text, j.find('id').text, j.find("name").text,j.find("active").text,
i.find('name').text, i.find('active').text, i.find('billable').text,
i.find('bill-by').text, i.find('hourly-rate').text, i.find('budget').text,
i.find('over-budget-notification-percentage').text, i.find('created-at').text,
i.find('updated-at').text, i.find('starts-on').text, i.find('ends-on').text,
i.find('estimate').text, i.find('estimate-by').text, i.find('notes').text, i.find('cost-budget').text,
k.find("cost_rate").text])


What I'd like to have in the outcome is:

Pid, Clientid, ClientName, ClientActive, ProjectName, ProjectActive, Billable, BillBy, HourlyRate, Budget, OverbudgetNotificationPercentage, CreatedAt, UpdatedAt, StartsOn, EndsOn, Estimate, EstimateBy, Notes, CostBudget, TeammemberName1, CostRate1, TeammemberName2, CostRate2, TeammemberName3, CostRate3


See CostRate and TeammemberName for my above test. There's only one entry per team_member/item but there might be 1 item or there might 10 items. I'd like to have it add an additional column per.

thsoe fields tags in the xml are

<full_name>Bob R</full_name>
<cost_rate>76.0</cost_rate>


Additionally if that works I want to add other fields from the big sheet, so under tasks/item for example there are these fields:

<task_id>66632</task_id>
<billed_rate>25.0</billd_rate>


I'd like to be able to have THOSE also show up in that order so:

Pid, Clientid, ClientName, ClientActive, ProjectName, ProjectActive, Billable, BillBy, HourlyRate, Budget, OverbudgetNotificationPercentage, CreatedAt, UpdatedAt, StartsOn, EndsOn, Estimate, EstimateBy, Notes, CostBudget, TeammemberName1, CostRate1, TeammemberName2, CostRate2, TeammemberName3, CostRate3, TaskId1, TotalHours1, TaskId2, TotalHours2


I can add the rest of the fields I need using the same pattern for those.

Here's an example of the initial projects list ( can provide the full file but its just hundreds of the same as the example)

https://www.dropbox.com/s/suxrt6jgeeanpol/singlefirstProjectsList.xml?dl=0

Here's an example of the clients list, again this is big if its the full out put. All I need form this file is the NAME to match the client-id in the project list file (unfortunately the project list doesnt just include the client name. Maybe because multiple clients can be under same project ID:

https://www.dropbox.com/s/x3khayr9s3g8pur/twofirstClientfromFullList.xml?dl=0

Here's what the big XML looks like after your help previously that has the TASKS and TEAMMEMBERs. I'll need multiple fields from this eventually but just a couple to get me started would do:

https://www.dropbox.com/s/x6323eo077xj7hn/singlePIDexampleBig.xml?dl=0

Here is my full script so far (changed some of the personal info so the actual calls and passwords wont work):

https://www.dropbox.com/s/x27nse8qesjuiv6/08xmlTestExample.py?dl=0

Answer

Consider iterating across both files and conditionally check for client ids:

import csv
import xml.etree.ElementTree as ET 

def parseXML():
    itemtree = ET.fromstring(itemdata)
    clienttree = ET.fromstring(cllientdata)

    with open('Output.csv', 'w') as f:             
        csvwriter = csv.writer(f, lineterminator = '\n')
        # COLUMN HEADERS
        csvwriter.writerow(['id','clientid','name','current','status'])

        for i in itemtree.iter('item'):
            for j in clienttree.iter('client'):
                if i.find('client-id').text == j.find('id').text:
                    # DATA ROWS
                    csvwriter.writerow([i.find('id').text, j.find('id').text,
                                        j.find('name').text, j.find('current').text,
                                        j.find('status').text])

if __name__ == "__main__":
    parseXML()