ShanZhengYang ShanZhengYang - 5 months ago 10
HTML Question

Navigation with BeautifulSoup

I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.

import requests
from bs4 import BeautifulSoup

url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")

# Now I navigate the soup
for a in soup.findAll('a'):
print a.get("href")



  1. Is there a way to find only particular
    href
    by the labels? For example, all the
    href
    's I want are called by a certain name, e.g.
    price
    in an online catalog.

  2. The
    href
    links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?

  3. How can I scrape the contents within each
    href
    link and save into a file format?


Answer

With BeautifulSoup, that's all doable and simple.

(1) Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.

Say, all the links you need have price in the text - you can use a text argument:

soup.find_all("a", text="price")  # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text)  # 'price' is inside the text

Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:

import re

soup.find_all("a", text=re.compile(r"^[pP]rice"))

If price is somewhere in the "href" attribute, you can have use the following CSS selector:

soup.select("a[href*=price]")  # href contains 'price'
soup.select("a[href^=price]")  # href starts with 'price'
soup.select("a[href$=price]")  # href ends with 'price'

or, via find_all():

soup.find_all("a", href=lambda href: href and "price" in href)

(2) The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?

Sure, locate the appropriate container and call find_all() or other searching methods:

container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
    print(link["href"])

Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a elements having href attributes located inside a div element having container class:

soup.select("div.container a[href]")

(3) How can I scrape the contents within each href link and save into a file format?

If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).

If you would stay with requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

base_url = 'http://examplewebsite.com'
with requests.Session() as session:  # maintaining a web-scraping session
    soup = BeautifulSoup(session.get(base_url).content, "html.parser")

    for link in soup.select("div.container a[href]"):
        full_link = urljoin(base_url, link["href"])
        title = a.get_text(strip=True)

        with open(title + ".html", "w") as f:
            f.write(session.get(full_link).content)

You may look into grequests or Scrapy to solve that part.