Daniel Daniel - 1 month ago 10
Python Question

BeautifulSoup Doesn't Pull all Elements

I am trying to scrape information from http://www.emoryhealthcare.org/locations/offices/advanced-digestive-care-1.html .

I would like to scrape the Specialties that appear in the lower third of the page, namely "Gastroenterology" and "Internal Medicine". When I inspect the element, I see that it is a

li
of
<div class="module bordered specialist">
yet when I attempt to loop through the soup and print the each found item, different results than expected are returned.

<div class="module bordered specialist">
<ul>
<li>Cardiac Care</li>
<li>Transplantation</li>
<li>Cancer Care (Oncology)</li>
<li>Diagnostic Radiology</li>
<li>Neurosciences</li>
<li>Mental Health Services</li>
</ul>
</div>


When I open the website in a browser, I see the above values flash prior to the contents switching to the expected results. Is there a way for me to improve the likelihood that I am able to scrape the items that I intend to?

Answer

Just use selenium to wait a few seconds, then parse like you were doing before. That seemed to do the trick.

from selenium import webdriver
import os
import time
from bs4 import BeautifulSoup

chromedriver = "/Users/Rafael/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.emoryhealthcare.org/locations/offices/advanced-digestive-care-1.html')
time.sleep(5)
html = driver.page_source

soup = BeautifulSoup(html, 'lxml')
results = soup.find_all("div", { "class" : "module bordered specialist" })
print(results[0].text) #prints GastroenterologyInternal Medicine
Comments