LatentFreedom LatentFreedom - 1 year ago 132
HTML Question

Why are li's not showing up with python requests response?

I have a homework project with web scraping and am suppose to collect all the even information for a month from a school website. I am using Python with Requests and Beautiful Soup. I have written some code to grab a url and am trying to grab all of the li's from the page that hold the event information. However, when I go to grab all of the li content I noticed that I am not receiving all of them. I have been thinking it is due to the style of "overflow:hidden" for the ul but why am I able to get the first few li's then?

from bs4 import BeautifulSoup
import requests

url = ''
r = requests.get(url)
bsObj = BeautifulSoup(r.text,"html.parser")

eventList = []
eventURLs = bsObj.find_all("a",href=True)
print len(eventURLs)

count = 1
for url in eventURLs:
print str(count) + '. ' + url['href']
count += 1

I am printing out the urls because I plan on going to the href link inside of the events to get the full descriptions and other metadata provided. However, I am not getting all of the event lis. I am only getting the first 5. The links in the output that I get that are for the events are numbers 19 to 23. The page has 10 total events though.


2. #advancedSearch
3. /ccl-prd/events/view?type=week&date=06012016&pubCalId=GRP1322
4. /ccl-prd/events/view?type=month&date=06012016&pubCalId=GRP1322
5. /ccl-prd/events/view?type=day&date=06222016&pubCalId=GRP1322
6. /ccl-prd/events/view?pubCalId=GRP1432&type=day&date=06012016
7. /ccl-prd/events/view?pubCalId=GRP1445&type=day&date=06012016
8. /ccl-prd/events/view?pubCalId=GRP1436&type=day&date=06012016
9. /ccl-prd/events/view?pubCalId=GRP1438&type=day&date=06012016
10. /ccl-prd/events/view?pubCalId=GRP1440&type=day&date=06012016
11. /ccl-prd/events/view?pubCalId=GRP1443&type=day&date=06012016
12. /ccl-prd/events/view?pubCalId=GRP1434&type=day&date=06012016
13. /ccl-prd/events/view?pubCalId=GRP1447&type=day&date=06012016
14. /ccl-prd/events/view?pubCalId=GRP1450&type=day&date=06012016
17. /ccl-prd/events/view?type=day&date=06012016&iub=BL011&pubCalId=GRP1322
18. /ccl-prd/events/view?type=day&date=06012016&iub=BL153&pubCalId=GRP1322
19. /ccl-prd/events/view/13147231?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322
20. /ccl-prd/events/view/13163329?viewParams=%26type%3dday%26date%3d06012016&referrer=listView&pubCalId=GRP1322
21. /ccl-prd/events/view/13163465?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322
22. /ccl-prd/events/view/13110443?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322
23. /ccl-prd/events/view/11744967?viewParams=%26type%3dday%26date%3d06012016&theDate=06222016&referrer=listView&pubCalId=GRP1322

TLDR: I am not getting all the links from the lis on a page when I use Python requests and beautiful soup. Why am I not getting the links and is there a better way of going about this problem?

Edited to give answer: The links I needed were all being created with Javascript and since Requests and Beautiful soup do not run the Javascript I have instead moved to Selenium with PhantomJS.

Answer Source

I looked at the source of the page, and in the plain HTML, there are 25 <a> elements that have an href attribute. These are the 25 links that your script is finding.

Also, I'm not sure which events on that page are the ones that you're actually looking for, but I'm gonna guess that many (if not all) of those urls that were printed out are not actually the events that you're looking for (more on this later).

The reason you're not finding the other links that you see when you go to the page in your browser, is because they are generated using JavaScript. BeautifulSoup only looks at the plain HTML, and doesn't run any JavaScript, as it's just a tool for analyzing and modifying static HTML or XML files. From their documentation:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

You need to either utilize something with a JavaScript engine to actually generate those elements, or find out where this page is pulling its event list from, and go there for your data.

You can try using a real browser with something like Selenium, which even lets you search through the DOM similarly to BeautifulSoup so you wouldn't need to use BeautifulSoup as well. If you're dead set on using BeautifulSoup, though, you can use Selenium to control a browser so it generates the elements using JavaScript (since that's what browsers do automatically), and then have Selenium just give you the source by calling something like this (driver.page_source will only get you what requests gave you):

html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

There's also headless browsers ("headless" meaning that it doesn't have a GUI, so you'd never see it and it wouldn't need a display) that you could use if you'd prefer that, or your script needs to run on something without a display (I know that Firefox simply won't launch if you don't have a display connected). I'd imagine that there's a way to utilize BeautifulSoup with these browsers too, if you really want to.

If you decide to go the route where you look at where this page is pulling its event data from, you might be able to get away with just using requests, because if the JavaScript is just fetching some JSON file, requests has a response.json() function that will turn the whole thing into a python dict, and you can just search through that.

If you are using an HTML parser though(e.g. BeautifulSoup, Selenium), you should definitely try to narrow down where you're searching for these links by finding the element on the page that contains all these <a> elements, and then calling .find_all("a", href=True) (for BeautifulSoup) or .find_elements_by_css_selector("a[href]") (for Selenium) on that element object (yes, you can do that, which is awesome!).

I'm not sure of the exact criteria of your assignment, so I have no idea if any of these options conflict with them. But I hope I at least pointed you in the right direction.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download