Daniel Daniel - 2 months ago 11
Python Question

Python BS4 with SDMX

I would like to retrieve data given in a SDMX file (like https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its). I tried to use BeautifulSoup, but it seems, it does not see the tags. In the following the code

import urllib2
from bs4 import BeautifulSoup
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")


which gives me an empty object.

Is BS4 the wrong tool, or (more likely) what am I doing wrong?
Thanks in advance

Answer

soup.findAll("bbk:series") would return the result.

In fact, in this case, even you use lxml as the parser, BeautifulSoup still parse it as html, since html tags are case insensetive, BeautifulSoup downcases all the tags, thus soup.findAll("bbk:series") works. See Other parser problems from the official doc.

If you want to parse it as xml, use soup = BeautifulSoup(html_source, 'xml') instead. It also uses lxml since lxml is the only xml parser BeautifulSoup has. Now you can use ts_series = soup.findAll("Series") to get the result as beautifulSoup will strip the namespace part bbk.