I have a homework project with web scraping and am suppose to collect all the even information for a month from a school website. I am using Python with Requests and Beautiful Soup. I have written some code to grab a url and am trying to grab all of the li's from the page that hold the event information. However, when I go to grab all of the li content I noticed that I am not receiving all of them. I have been thinking it is due to the style of "overflow:hidden" for the ul but why am I able to get the first few li's then?
from bs4 import BeautifulSoup
url = 'https://apps.iu.edu/ccl-prd/events/view?date=06012016&type=day&pubCalId=GRP1322'
r = requests.get(url)
bsObj = BeautifulSoup(r.text,"html.parser")
eventList = 
eventURLs = bsObj.find_all("a",href=True)
count = 1
for url in eventURLs:
print str(count) + '. ' + url['href']
count += 1
I looked at the source of the page, and in the plain HTML, there are 25
<a> elements that have an href attribute. These are the 25 links that your script is finding.
Also, I'm not sure which events on that page are the ones that you're actually looking for, but I'm gonna guess that many (if not all) of those urls that were printed out are not actually the events that you're looking for (more on this later).
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
driver.page_source will only get you what
requests gave you):
html = driver.execute_script("return document.getElementsByTagName('html').innerHTML")
There's also headless browsers ("headless" meaning that it doesn't have a GUI, so you'd never see it and it wouldn't need a display) that you could use if you'd prefer that, or your script needs to run on something without a display (I know that Firefox simply won't launch if you don't have a display connected). I'd imagine that there's a way to utilize BeautifulSoup with these browsers too, if you really want to.
If you decide to go the route where you look at where this page is pulling its event data from, you might be able to get away with just using
requests has a
response.json() function that will turn the whole thing into a python
dict, and you can just search through that.
If you are using an HTML parser though(e.g. BeautifulSoup, Selenium), you should definitely try to narrow down where you're searching for these links by finding the element on the page that contains all these
<a> elements, and then calling
.find_all("a", href=True) (for BeautifulSoup) or
.find_elements_by_css_selector("a[href]") (for Selenium) on that element object (yes, you can do that, which is awesome!).
I'm not sure of the exact criteria of your assignment, so I have no idea if any of these options conflict with them. But I hope I at least pointed you in the right direction.