Guru Guru - 6 months ago 45
HTML Question

unable to scrape simple html content using beautifulsoup

enter image description here

I am trying to get the list of the companies from angellist https://angel.co/companies

I tried with this code

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://angel.co/companies', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class"," dc59 frw44 _a _jm"})
print p1


But this returns an empty string.

I had gone through similar questions, some say update beautifulsoup, some say change parser. Nothing is working for me.

Answer

The data you want to extract are generated by JavaScript. That is why p1 is an empty list; urllib2.urlopen(req).read() gives you the server response, it doesn't wait for JS.

Use BeautifulSoup in combination with Selenium.

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://angel.co/companies')
html = browser.page_source

soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class", " dc59 frw44 _a _jm"})
print p1

Also, if this won't work (not tested), make the class selector simpler, i.e. try searching for dc59 only and make it gradually more specific.