B_Furtado B_Furtado - 7 months ago 20
Python Question

How to systematically avoid or ignore child index out of range when reading large XML

I am reading a large XML with more than 106.000 entries. Each entry is a group of researchers with a lot of information. I made a reading function.

However, if at any time, any information is missing, I will get


IndexError: child index out of range


Is there a way to tell the program to ignore when the child is missing?

Because of the diversity of the data, I will probably have different sizes of information for each individual collected data.

It is probably not a good idea to check each time, such as:

if root[0]0][0][0]:
tot_nac_2011 = int(root[0][0][0][0].attrib['TOT-BIBL-PERIODICO-NAC']


Here is my code

from xml.etree import ElementTree
extended = ElementTree.parse('0000301510136952_2014_estendido.xml')

def read_researcher(extended):
root = extended.getroot()
members = []
for each in range(len(root[0])):
group_id = root.attrib['NRO-ID-GRUPO']
research_id = root[0][each].attrib['NRO-ID-CNPQ']
name = root[0][each].attrib['NOME-COMPLETO']
tit = root[0][each].attrib['TITULACAO-MAXIMA']
sex = root[0][each].attrib['SEXO']
tot_nac_2011 = int(root[0][each][0][0].attrib['TOT-BIBL-PERIODICO-NAC'])
tot_nac_2014 = int(root[0][each][0][3].attrib['TOT-BIBL-PERIODICO-NAC'])
tot_int_2011 = int(root[0][each][0][0].attrib['TOT-BIBL-PERIODICO-INT'])
tot_int_2014 = int(root[0][each][0][3].attrib['TOT-BIBL-PERIODICO-INT'])
tot_bbl_2011 = int(root[0][each][0][0].attrib['TOT-BIBL'])
tot_bbl_2014 = int(root[0][each][0][3].attrib['TOT-BIBL'])
members.append(researchers.Researcher(group_id, research_id, name, tit, sex, tot_nac_2011, tot_nac_2014, tot_int_2011, tot_int_2014, tot_bbl_2011, tot_bbl_2014))
return members

Answer

To answer your specific question: use the exception handling via try/except and handle the relevant errors that might happen when you extract the attribute values from child elements. This is sort of the EAFP programming style. There is also the LBYL one.

I would also improve the code using an intermediate dictionary to handle the Researcher object initialization arguments, move the group_id from under the loop, since we are getting it from the root element.

The code:

from xml.etree import ElementTree


extended = ElementTree.parse('0000301510136952_2014_estendido.xml')


def get_value(item, index, value):
    try:
        return int(item[index].attrib[value])
    except (IndexError, KeyError, AttributeError, ValueError):
        # TODO: log
        return None


def read_researcher(extended):
    root = extended.getroot()
    group_id = root.attrib['NRO-ID-GRUPO']

    members = []
    for item in root[0]:
        subitem = item[0]
        researcher = {
            "group_id": group_id,
            "research_id": item.attrib.get('NRO-ID-CNPQ'),
            "name": item.attrib.get('COMPLETO'),
            "tit": item.attrib.get('TITULACAO-MAXIMA'),
            "sex": item.attrib.get('SEXO'),
            "tot_nac_2011": get_value(subitem, 0, 'TOT-BIBL-PERIODICO-NAC'),
            "tot_nac_2014": get_value(subitem, 3, 'TOT-BIBL-PERIODICO-NAC'),
            "tot_int_2011": get_value(subitem, 0, 'TOT-BIBL-PERIODICO-INT'),
            "tot_int_2014": get_value(subitem, 3, 'TOT-BIBL-PERIODICO-INT'),
            "tot_bbl_2011": get_value(subitem, 0, 'TOT-BIBL'),
            "tot_bbl_2014": get_value(subitem, 3, 'TOT-BIBL'),
        }

        members.append(researchers.Researcher(**researcher))
    return members
Comments