Atma Atma -4 years ago 89
Python Question

beautiful soup returns close tag instead of tag text

I have the following rss feed (soundcloud) :

<pubDate>Mon, 05 Jun 2017 00:00:00 +0000</pubDate>

I attempt to get the link tag contents with the following:

soup = BeautifulSoup(response, "lxml")

items = soup.findAll("item")
for i in items:
print i
created_at = i.find('pubdate')
created_at = created_at.contents[0][:16]

url = i.find('link')

This prints:


If I try
url = i.find('link').string
url = i.find('link').content

I get


When I print the "i" item it prints a close tag first for link:
Daptone Records
Sharon Jones & the Dap-Kings' first ever holiday album is out now!

How can I get the link to open properly?

Answer Source

You can do something like this and it'll do the job:

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen

url = ''
data = urlopen(url).read()

parsed = bs(data, 'xml')
items = parsed.findAll('item')

for k in items:
    # Here is how you can access to the tags inside item tag
    print("pubDate:", k.pubDate.text)

Edit: Using lxml

When i try to parse the <link>...</link> tag using BeautifulSoup and lxml i got an invalid tag. Every link's tag begins by </link> and BeautifulSoup can't manage to parse its data.

So, an easy hack is using regex, here is an example:

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen
import re

url = ''
data = urlopen(url).read()

soup = bs(data, 'lxml')
aa = soup.findAll('item')

for k in aa:
    link = re.findall('<link/>(.*?)\s+', str(k))
    pubdate = k.find('pubdate').string
    print("Link: {}\npubdate: {}".format(' '.join(link), pubdate))

Both methods will output:

pubDate: Tue, 21 Mar 2017 20:30:49 +0000
pubDate: Sun, 28 Jun 2015 00:00:00 +0000
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download