Atma Atma -4 years ago 89
Python Question

beautiful soup returns close tag instead of tag text

I have the following rss feed (soundcloud) http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss :

<item>
<pubDate>Mon, 05 Jun 2017 00:00:00 +0000</pubDate>
<link>https://example.com</link>
<item>


I attempt to get the link tag contents with the following:

soup = BeautifulSoup(response, "lxml")


items = soup.findAll("item")
for i in items:
print i
created_at = i.find('pubdate')
created_at = created_at.contents[0][:16]

url = i.find('link')

This prints:

<link/>


If I try
url = i.find('link').string
or
url = i.find('link').content


I get


None


When I print the "i" item it prints a close tag first for link:

https://soundcloud.com/daptone-records/sharon-jones-the-dap-kings-white-christmas
00:02:23
Daptone Records
no
Sharon Jones & the Dap-Kings' first ever holiday album is out now!


How can I get the link to open properly?

Answer Source

You can do something like this and it'll do the job:

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen

url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()

parsed = bs(data, 'xml')
items = parsed.findAll('item')

for k in items:
    # Here is how you can access to the tags inside item tag
    print("Link:", k.link.text)
    print("pubDate:", k.pubDate.text)

Edit: Using lxml

When i try to parse the <link>...</link> tag using BeautifulSoup and lxml i got an invalid tag. Every link's tag begins by </link> and BeautifulSoup can't manage to parse its data.

So, an easy hack is using regex, here is an example:

from bs4 import BeautifulSoup as bs 
from urllib.request import urlopen
import re

url = 'http://feeds.soundcloud.com/users/soundcloud:users:7393028/sounds.rss'
data = urlopen(url).read()

soup = bs(data, 'lxml')
aa = soup.findAll('item')

for k in aa:
    link = re.findall('<link/>(.*?)\s+', str(k))
    pubdate = k.find('pubdate').string
    print("Link: {}\npubdate: {}".format(' '.join(link), pubdate))

Both methods will output:

Link: https://soundcloud.com/daptone-records/move-upstairs
pubDate: Tue, 21 Mar 2017 20:30:49 +0000
...
Link: https://soundcloud.com/daptone-records/the-frightnrs-id-rather-go-blind-1
pubDate: Sun, 28 Jun 2015 00:00:00 +0000
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download