I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.
from bs4 import BeautifulSoup
url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")
# Now I navigate the soup
for a in soup.findAll('a'):
BeautifulSoup, that's all doable and simple.
(1) Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.
Say, all the links you need have
price in the text - you can use a
soup.find_all("a", text="price") # text equals to 'price' exactly soup.find_all("a", text=lambda text: text and "price" in text) # 'price' is inside the text
import re soup.find_all("a", text=re.compile(r"^[pP]rice"))
price is somewhere in the "href" attribute, you can have use the following CSS selector:
soup.select("a[href*=price]") # href contains 'price' soup.select("a[href^=price]") # href starts with 'price' soup.select("a[href$=price]") # href ends with 'price'
soup.find_all("a", href=lambda href: href and "price" in href)
(2) The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
Sure, locate the appropriate container and call
find_all() or other searching methods:
container = soup.find("div", class_="container") for link in container.select("a[href*=price"): print(link["href"])
Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for
a elements having
href attributes located inside a
div element having
(3) How can I scrape the contents within each href link and save into a file format?
If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).
If you would stay with
requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:
from urlparse import urljoin from bs4 import BeautifulSoup import requests base_url = 'http://examplewebsite.com' with requests.Session() as session: # maintaining a web-scraping session soup = BeautifulSoup(session.get(base_url).content, "html.parser") for link in soup.select("div.container a[href]"): full_link = urljoin(base_url, link["href"]) title = a.get_text(strip=True) with open(title + ".html", "w") as f: f.write(session.get(full_link).content)