mariz mariz - 4 months ago 16
Java Question

how to generate an xml file by taking the first record from a large xml file of size 10gb without getting memory error?

i have a large xml file of size 10 gb and i want to create a new xml file which is generated from the first record of the large file.i tried to do this in java and python but i got memory error since i'm loading the entire data.

In another post,someone suggested XSLT is the best solution for this.I'm new to XSLT,i don't know how to do this in xslt,pls suggest some style sheet to do this...

Large XML file(10gb) sample:

<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
<Header>
<BusinessPartner>CHILIS_US</BusinessPartner>
<FileType>mde</FileType>
<FileNumber>17</FileNumber>
<FormatVariant>1</FormatVariant>
<NumberOfRecords>22</NumberOfRecords>
<CreationDate>2016-06-07T12:00:46-07:00</CreationDate>
</Header>
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
.....
.....
</MemberDataExport>


I want to create a file like this..

<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
</MemberDataExport>


is there any other way i can do this without getting any memory error? pls suggest that too.

Answer

In Python (which you mentioned besides Java) you could use ElementTree.iterparse and then break parsing when you have found the element(s) you want to copy:

import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1

for event, elem in ET.iterparse('input1.xml', events = ('start', 'end')):
    if event == 'start':
        level = level + 1
        if level == 0:
            result = ET.ElementTree(ET.Element(elem.tag))

    if event == 'end':
        level = level - 1
        if level == 0:
            count = count + 1
            if count <= copy:
                result.getroot().append(elem)
            else:
                break



result.write('result1.xml', 'UTF-8', True, 'http://www.payback.net/lmsglobal/batch/memberdataexport')

As for better namespace prefix preservation, I have had some success using the event start-ns and registering the collected namespaces on the ElementTree:

import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1

for event, elem in ET.iterparse('input1.xml', events = ('start', 'end', 'start-ns')):
    if event == 'start':
        level = level + 1
        if level == 0:
            result = ET.ElementTree(ET.Element(elem.tag))

    if event == 'end':
        level = level - 1
        if level == 0:
            count = count + 1
            if count <= copy:
                result.getroot().append(elem)
            else:
                break

    if event == 'start-ns':
        ET.register_namespace(elem[0], elem[1])


result.write('result1.xml', 'UTF-8', True)