Richard Richard - 3 months ago 18
Python Question

pyquery (lxml) not finding a tag in a well-structured XML document?

I have an XML file that looks like this. The relevant bit is this:

<reference>
<citation>Vander Wal JS, Gang CH, Griffing GT, Gadde KM. Escitalopram for treatment of night eating syndrome: a 12-week, randomized, placebo-controlled trial. J Clin Psychopharmacol. 2012 Jun;32(3):341-5. doi: 10.1097/JCP.0b013e318254239b.</citation>
<PMID>22544016</PMID>
</reference>


I am trying to find the value of the
PMID
field, using PyQuery to parse the XML:

from pyquery import PyQuery as pq

text = open(f, 'r').read()
d = pq(text)
data = {}
data['nct_id'] = d('nct_id').text()

print d('reference')
reference = d('reference')
print reference('PMID')
data['pmid'] = reference('PMID').text()

print data['PMID']


Why isn't this working? In the console I see the full content of
reference
from the first print statement, followed by two empty values:

<reference>
<citation>Vander Wal JS, Gang CH, Griffing GT, Gadde KM. Escitalopram for treatment of night eating syndrome: a 12-week, randomized, placebo-controlled trial. J Clin Psychopharmacol. 2012 Jun;32(3):341-5. doi: 10.1097/JCP.0b013e318254239b.</citation>
<PMID>22544016</PMID>
</reference>


I can find other leaf nodes in the document (like
nct_id
) just fine using
.find()
, as the example code shows.

Is it that PyQuery doesn't like upper-case tags?

Answer

You can specifiy the parser to use and it will work:

d = pq(text, parser='xml')