For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs import requests domain = "engage.hpe.com" initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage' # get urls to scrape text = requests.get(initial_url).text initial_soup = bs.BeautifulSoup(text, 'lxml') tags = initial_soup.findAll('a', href=True) urls =  for tag in tags: if domain in tag: urls.append(tag['href']) urls.append(initial_url) print(urls) # function to grab your info def scrape_desired_info(url): out =  text = requests.get(url) soup = bs.BeautifulSoup(text, 'lxml') for div in soup.find_all('div', class_='elqFormRow'): out.append(div.text.strip()) return out info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.