user3768258 user3768258 - 3 months ago 13
Python Question

Get text between two closed tags XML - Python

I downloaded my Foursquare data and it comes in KML format. I'm parsing through it as an XML file with Python and cannot figure out how to get the text between the closed a tag and closed description tag. (It's the text that I typed when I checked in, in the example below it's "FINALLY HERE!! With Sonya and co" but there's also a hyphen).

This is an example of what the data looks like.

<Placemark>
<name>hummus grill</name>
<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
<updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
<published>Tue, 24 Jan 12 17:14:00 +0000</published>
<visibility>1</visibility>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-75.20104383595685,39.9528387056977</coordinates>
</Point>
</Placemark>


So far I've been able to get the lat/longs, published dates, name, and link with code something like this for all:

latitudes = []
longitudes = []

for d in dom.getElementsByTagName('coordinates'):
#Break them up into latitude and longitude
coords = d.firstChild.data.split(',')
longitudes.append(float(coords[0]))
latitudes.append(float(coords[1]))


I tried this (below is the beginning of the data has this header thing, haven't figured out how to handle it yet)

for d in dom.getElementsByTagName('description'):
description.append(d.firstChild.data.encode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>
<kml><Folder><name>foursquare checkin history </name><description>foursquare checkin history </description>:


and then accessing it by this d.firstChild.nextSibling.firstChild.data.encode('utf-8'), but it just gives me "hummus grill", what I'm assuming to be the text between the a tags (instead of from the name tag).

Answer

The following works for me:

In [44]: description = []

In [45]: for d in dom.getElementsByTagName('description'):
   ....:     description.append(d.firstChild.nextSibling.nextSibling.data.encode('utf-8'))
   ....:     

In [46]: description
Out[46]: ['- FINALLY HERE!! With Sonya and co']

Or, if you want the entire text in the description tag:

from xml.dom.minidom import parse, parseString

def getText(node, recursive = False):
    """ 
    Get all the text associated with this node.
    With recursive == True, all text from child nodes is retrieved
    """
    L = ['']
    for n in node.childNodes:
        if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE):
            L.append(n.data)
        else:
            if not recursive:
                return None
        L.append(getText(n))
    return ''.join(L)

dom = parseString("""<Placemark>
  <name>hummus grill</name>
  <description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
  <updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
  <published>Tue, 24 Jan 12 17:14:00 +0000</published>
  <visibility>1</visibility>
  <Point>
    <extrude>1</extrude>
    <altitudeMode>relativeToGround</altitudeMode>
    <coordinates>-75.20104383595685,39.9528387056977</coordinates>
  </Point>
</Placemark>""")

description = []

for d in dom.getElementsByTagName('description'):
    description.append(getText(d, recursive = True))

print description

This will print: [u'@hummus grill- FINALLY HERE!! With Sonya and co']

Comments