hselbie hselbie - 6 months ago 45
Python Question

BS4 get XML tag variables

I am playing around with web scraping using bs4 and trying to get the title and color tag from this line of xml

<graph gid="1" color="#000000" balloon_color="#000000" title="Approve">

The output result would be a dict something along the lines of
{'title':'approve', 'color':'#000000'}

The page where the xml is here

I've already written this function which is by no means efficient, but would like the titles of my dataframe to be the result of the
rather than a manually inputted value. So rather than
it would read
or whatever the result of title is.

def rcp_poll_data(xml):
dates = soup.find('series')
datesval = dates.findChildren(string=True)
del datesval[-7:]
obama = soup.find('graph', { "gid" : "1" })
obamaval = obama.findChildren(string=True)
romney = soup.find('graph', { "gid" : "2" })
romneyval = romney.findChildren(string=True)
result = pd.DataFrame({'date':pd.to_datetime(datesval), 'GID1':obamaval, 'GID2':romneyval})
return result

I'm using bs4 and struggling to find the right terminology that would get me there. Are these tags i'm trying to isolate, or elements, or attributes?

This isn't a professional thing i'm just nurdling around for fun. So any help to get me slightly closer would be great. (i'm using python 3)


You just need to pull the attributes once you find the graph node:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://charts.realclearpolitics.com/charts/1044.xml").content,"xml")
g = soup.find("graph", gid="1")
data = {"title":g["title"], "color": g["color"]}

Which will give you:

{'color': '#000000', 'title': 'Approve'}