B_Furtado B_Furtado - 1 year ago 49
Python Question

How to systematically avoid or ignore child index out of range when reading large XML

I am reading a large XML with more than 106.000 entries. Each entry is a group of researchers with a lot of information. I made a reading function.

However, if at any time, any information is missing, I will get

IndexError: child index out of range

Is there a way to tell the program to ignore when the child is missing?

Because of the diversity of the data, I will probably have different sizes of information for each individual collected data.

It is probably not a good idea to check each time, such as:

if root[0]0][0][0]:
tot_nac_2011 = int(root[0][0][0][0].attrib['TOT-BIBL-PERIODICO-NAC']

Here is my code

from xml.etree import ElementTree
extended = ElementTree.parse('0000301510136952_2014_estendido.xml')

def read_researcher(extended):
root = extended.getroot()
members = []
for each in range(len(root[0])):
group_id = root.attrib['NRO-ID-GRUPO']
research_id = root[0][each].attrib['NRO-ID-CNPQ']
name = root[0][each].attrib['NOME-COMPLETO']
tit = root[0][each].attrib['TITULACAO-MAXIMA']
sex = root[0][each].attrib['SEXO']
tot_nac_2011 = int(root[0][each][0][0].attrib['TOT-BIBL-PERIODICO-NAC'])
tot_nac_2014 = int(root[0][each][0][3].attrib['TOT-BIBL-PERIODICO-NAC'])
tot_int_2011 = int(root[0][each][0][0].attrib['TOT-BIBL-PERIODICO-INT'])
tot_int_2014 = int(root[0][each][0][3].attrib['TOT-BIBL-PERIODICO-INT'])
tot_bbl_2011 = int(root[0][each][0][0].attrib['TOT-BIBL'])
tot_bbl_2014 = int(root[0][each][0][3].attrib['TOT-BIBL'])
members.append(researchers.Researcher(group_id, research_id, name, tit, sex, tot_nac_2011, tot_nac_2014, tot_int_2011, tot_int_2014, tot_bbl_2011, tot_bbl_2014))
return members


To answer your specific question: use the exception handling via try/except and handle the relevant errors that might happen when you extract the attribute values from child elements. This is sort of the EAFP programming style. There is also the LBYL one.

I would also improve the code using an intermediate dictionary to handle the Researcher object initialization arguments, move the group_id from under the loop, since we are getting it from the root element.

The code:

from xml.etree import ElementTree

extended = ElementTree.parse('0000301510136952_2014_estendido.xml')

def get_value(item, index, value):
        return int(item[index].attrib[value])
    except (IndexError, KeyError, AttributeError, ValueError):
        # TODO: log
        return None

def read_researcher(extended):
    root = extended.getroot()
    group_id = root.attrib['NRO-ID-GRUPO']

    members = []
    for item in root[0]:
        subitem = item[0]
        researcher = {
            "group_id": group_id,
            "research_id": item.attrib.get('NRO-ID-CNPQ'),
            "name": item.attrib.get('COMPLETO'),
            "tit": item.attrib.get('TITULACAO-MAXIMA'),
            "sex": item.attrib.get('SEXO'),
            "tot_nac_2011": get_value(subitem, 0, 'TOT-BIBL-PERIODICO-NAC'),
            "tot_nac_2014": get_value(subitem, 3, 'TOT-BIBL-PERIODICO-NAC'),
            "tot_int_2011": get_value(subitem, 0, 'TOT-BIBL-PERIODICO-INT'),
            "tot_int_2014": get_value(subitem, 3, 'TOT-BIBL-PERIODICO-INT'),
            "tot_bbl_2011": get_value(subitem, 0, 'TOT-BIBL'),
            "tot_bbl_2014": get_value(subitem, 3, 'TOT-BIBL'),

    return members