Boone Boone - 5 months ago 81
Python Question

How to parse not-well formed XML from web to ElementTree

I am having difficulties working with an XML file from the web. It isn't my XML so I can't modify the format.

import urllib
import xml.etree.ElementTree as ET

xmls = urllib.request.urlopen('http://odds.smarkets.com/oddsfeed.xml')

tree = ET.ElementTree(ET.fromstring(xmls.read())


But it gives me a ParseError: not well-formed (invalid token): line 1, column 0

I thought that it might have something to do with the way it is encoded, but I don't know anything about encoding, and when I ran it through Chared it says utf_8.

I also tried using BeautifulSoup, but it seems to only read the first line

<?xml version=1.0" encoding="utf-8"?>

Answer

Don't reinvent the wheel and use a special library for parsing XML feeds - feedparser:

from pprint import pprint

import feedparser

d = feedparser.parse('http://odds.smarkets.com/oddsfeed.xml')
pprint(d['feed'])

Prints:

{'contract': {'id': '16696354',
              'name': 'Arthur Burrell',
              'slug': 'arthur-burrell'},
 'event': {'date': '2016-07-01',
           'id': '741548',
           'name': '14:10',
           'parent': 'Newton Abbot',
           'parent_slug': 'newton-abbot',
           'slug': 'newton-abbot-2016-07-01T00:00:00-14-10',
           'time': '13:10:00',
           'type': 'horse racing race',
           'url': '/sport/horse-racing/newton-abbot/2016/07/01/14:10'},
 'market': {'id': '5464210', 'slug': 'to-place', 'winners': '3'},
 'odds': {'timestamp': '2016-07-01T 1:40:15'},
 'price': {'backers_stake': '2.50',
           'decimal': '1.35',
           'liability': '7.14',
           'percent': '74.07'}}
Comments