in_learning_software in_learning_software - 2 months ago 9
Python Question

How to parse XML of nested tags in Python

I have following XML.

<component name="QUESTIONS">
<topic name="Chair">
<state>active</state>
<subtopic name="Wooden">
<links>
<link videoDuration="" youtubeId="" type="article">
<label>Understanding Wooden Chair</label>
<url>http://abcd.xyz.com/1111?view=app</url>
</link>
<link videoDuration="" youtubeId="" type="article">
<label>How To Assemble Wooden CHair</label>
<url>http://abcd.xyz.com/2222?view=app</url>
</link>
<link videoDuration="11:35" youtubeId="Qasefrt09_2" type="video">
<label>Wooden Chair Tutorial</label>
<url>/</url>
</link>
<link videoDuration="1:06" youtubeId="MSDVN235879" type="video">
<label>How To Access Wood</label>
<url>/</url>
</link>
</links>
</subtopic>
</topic>
<topic name="Table">
<state>active</state>
<subtopic name="">
<links>
<link videoDuration="" youtubeId="" type="article">
<label>Understanding Tables</label>
<url>http://abcd.xyz.com/3333?view=app</url>
</link>
<link videoDuration="" youtubeId="" type="article">
<label>Set-up Table</label>
<url>http://abcd.xyz.com/4444?view=app</url>
</link>
<link videoDuration="" youtubeId="" type="article">
<label>How To Change table</label>
<url>http://abcd.xyz.com/5555?view=app</url>
</link>
</links>
</subtopic>
</topic>
</component>


I am trying to parse this xml in python and creating an
URL array
which will contain:
1. All the http urls present in the xml
2. For the link tab if youtube is present then capture that and prepare youtube url and add it to
URL array
.

I have following code, but it is not giving me url and links.

from xml.etree import ElementTree

with open('faq.xml', 'rt') as f:
tree = ElementTree.parse(f)

for node in tree.iter():
print node.tag, node.attrib.get('url')

for node in tree.iter('outline'):
name = node.attrib.get('link')
url = node.attrib.get('url')
if name and url:
print ' %s :: %s' % (name, url)
else:
print name


How can I achieve this to get all urls?

developed the following code based on below answers:
Problem with following is, it is printing just 1 url not all.

from xml.etree import ElementTree

def fetch_faq_urls():
url_list = []
with open('faq.xml', 'rt') as f:
tree = ElementTree.parse(f)

for link in tree.iter('link'):
youtube = link.get('youtubeId')
if youtube:
print "https://www.youtube.com/watch?v=" + youtube
video_url = "https://www.youtube.com/watch?v=" + youtube
url_list.append(video_url)
# print "youtubeId", link.find('label').text, '???'
else:
print link.find('url').text
article_url = link.find('url').text
url_list.append(article_url)
# print 'url', link.find('label').text,
return url_list

faqs = fetch_faq_urls()
print faqs

Answer

The information you want is under <link> so just iterate through those. Use get() to get the youtube id and find() to get the child <url> object.

from xml.etree import ElementTree

with open('faq.xml', 'rt') as f:
    tree = ElementTree.parse(f)

for link in tree.iter('link'):
    youtube = link.get('youtubeId')
    if youtube:
        print "youtube", link.find('label').text, '???'
    else:
        print 'url', link.find('label').text, link.find('url').text