My script below...
I feel like I'm missing one line of code to make this work properly. Using Reddit as a test source to scrap sport links.
# import libraries
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.reddit.com/r/BoxingStreams/comments/6w2vdu/mayweather_vs_mcgregor_archive_footage/'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
# html parsing
page_soup = soup(page_html, "html.parser")
hyperli = page_soup.findAll("form")
filename = "sportstreams.csv"
f = open(filename, "w")
headers = "Sport Links"
for containli in hyperli:
link = containli.a["href"]
As explained in the documentation Navigating using tag names:
Using a tag name as an attribute will give you only the first tag by that name
If you need to get all the
<a>tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as
In your case, you could use
page_soup.select("form a[href]") to find all the links in forms that have
links = page_soup.select("form a[href]") for link in links: href = link["href"] print(href) f.write(href + "\n")