Guru Guru - 6 months ago 9
HTML Question

unable to scrape

enter image description here

I am trying to get the list of the companies from angellist https://angel.co/companies

I tried with this code

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://angel.co/companies', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class"," dc59 frw44 _a _jm"})
print p1


But this returns an empty string.

I had gone through similar questions, some say update beautifulsoup, some say change parser. Nothing is working for me.

Answer

You can get all the company info html without needing selenium by getting the params from https://angel.co/company_filters/search_data:

import requests
from bs4 import BeautifulSoup



js = "https://angel.co/company_filters/search_data"

headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}




u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"
with requests.Session() as s:
    params = s.post(js, data={"sort": "signal"}, headers=headers).json()
    companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])),params["page"] ,params["total"], params["hexdigest"]), headers=headers)
    soup = BeautifulSoup(companies.json()["html"])

You can pass the page number as you iterate to simulate the load more:

import requests
from bs4 import BeautifulSoup
import time

# post url
js = "https://angel.co/company_filters/search_data"

# X-Requested-With is important
headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


# get url
u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"


def get_next_pages(js, u, start_page=1):
    with requests.Session() as s:
        params = s.post(js, data={"sort": "signal","page":start_page}, headers=headers).json()
        companies = s.get(
            u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"], params["hexdigest"]),
            headers=headers)
        soup = BeautifulSoup(companies.json()["html"])
        comps = soup.select("div.company.column")
        yield comps
        while True:
            # increment page count from previous.
            page = params["page"] + 1
            params = s.post(js, data={"sort": "signal", "page": page}, headers=headers).json()
            # keep going until we have reached the maximum queries
            if "ids" not in params:
                break
            companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"],
                                       params["hexdigest"]),
                              headers=headers)
            soup = BeautifulSoup(companies.json()["html"])
            comps = soup.select("div.company.column")
            # don't hammer with requests
            time.sleep(.3)
            yield comps

for comps in get_next_pages(js, u):
    print(comps)

If we look at the network output from developer tools, we can see the post data as we hit load more, it keeps going until we hit out limit:

enter image description here

A snippet of the output from running the code above:

[<div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies" title="Dunwello"><img alt="Dunwello" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275696-99335faecd2fb01467c98d5032f23cf6-thumb_jpg.jpg?buster=1393099676"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies">Dunwello</a>
</div>
<div class="pitch">
Trustworthy recommendations of individual professionals.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies" title="GroupAhead"><img alt="GroupAhead" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275832-3541a563250008bd3f7f9b4d7fe9c33c-thumb_jpg.jpg?buster=1423077576"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies">GroupAhead</a>
</div>
<div class="pitch">
Dedicated apps for groups
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies" title="Workpop"><img alt="Workpop" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/431492-c1b857e30254da60f3847d5358db5c82-thumb_jpg.jpg?buster=1404420060"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies">Workpop</a>
</div>
<div class="pitch">
When can you start?
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies" title="Late Stage Pre-IPO @ Flight.vc"><img alt="Late Stage Pre-IPO @ Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/446358-3511ab7edb5192dad97cbccf2b67ddd7-thumb_jpg.jpg?buster=1428089778"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies">Late Stage Pre-IPO @ Flight.vc</a>
</div>
<div class="pitch">
Syndicated:  Beepi, Zirx, Boost Media, Rent the Runway, Life 360, Scripted
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies" title="Complex Polygon"><img alt="Complex Polygon" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/450451-4f00fd11b2d54533a5bac3cfa72acb1e-thumb_jpg.jpg?buster=1407937645"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies">Complex Polygon</a>
</div>
<div class="pitch">
Product studio based in San Francisco, California. 
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies" title="21"><img alt="21" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/457068-2e7b8c417c3a70aab3026f5f0ca3d8e9-thumb_jpg.jpg?buster=1425975133"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies">21</a>
</div>
<div class="pitch">
A bitcoin miner in every device and in every hand.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies" title="Parenthoods"><img alt="Parenthoods" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/460720-25bc7ca7afd4f7bf0fd7842cafa1bdd1-thumb_jpg.jpg?buster=1425426951"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies">Parenthoods</a>
</div>
<div class="pitch">
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies" title="Seed"><img alt="Seed" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/462906-f6b439e20a9d36b9e2d3792da92d160d-thumb_jpg.jpg?buster=1462318689"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies">Seed</a>
</div>
<div class="pitch">
Online Business Banking
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies" title="Zen99"><img alt="Zen99" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/470102-67da791cec4374a1046c53fe99b6f05f-thumb_jpg.jpg?buster=1410560341"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies">Zen99</a>
</div>
<div class="pitch">
Finance and insurance tools for freelancers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies" title="Maven Ventures Growth Labs"><img alt="Maven Ventures Growth Labs" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/488240-d467860829cac8b1a9fbfa2d14e05789-thumb_jpg.jpg?buster=1411577330"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies">Maven Ventures Growth Labs</a>
</div>
<div class="pitch">
Get a option to invest up to $500k in the best Maven grads
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies" title="Skydio"><img alt="Skydio" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/507975-aac9786d6c4cba99be634b7bc1969cf3-thumb_jpg.jpg?buster=1420952326"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies">Skydio</a>
</div>
<div class="pitch">
MIT, Google[x]ers with deep prior experience doing intelligent navigation for drones
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies" title="Fin Tech by Flight.vc"><img alt="Fin Tech by Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/517240-5bc50eb42d1e40a8ad437c6bd164a5a8-thumb_jpg.jpg?buster=1414004533"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies">Fin Tech by Flight.vc</a>
</div>
<div class="pitch">
Investing in Financial Services and Fin-Tech that has proprietary advantages
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies" title="Channel"><img alt="Channel" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/521452-b6bc15ef040fdf37d885aea71ecad3bb-thumb_jpg.jpg?buster=1446676191"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies">Channel</a>
</div>
<div class="pitch">
Watch the world.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies" title="HealthSherpa"><img alt="HealthSherpa" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/443932-63c6bcbbf9ba36a7fa3e532177222c9b-thumb_jpg.jpg?buster=1462374897"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies">HealthSherpa</a>
</div>
<div class="pitch">
Next-generation Healthcare.gov
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies" title="Sidewire"><img alt="Sidewire" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/558206-b416bf8347c7f766b5ea1cf79123c4d2-thumb_jpg.jpg?buster=1444189112"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies">Sidewire</a>
</div>
<div class="pitch">
Where Experts Chat in Public
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies" title="Brainchild &amp;amp; Co."><img alt="Brainchild &amp; Co." class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/570055-cc2c2309fefa21e3ebda6229d6a0b890-thumb_jpg.jpg?buster=1420474118"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies">Brainchild &amp; Co.</a>
</div>
<div class="pitch">
Building services and products for consumers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies" title="Signatures Capital"><img alt="Signatures Capital" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/571060-8a077d7cbac9cc7e2d81859adb8cd1c6-thumb_jpg.jpg?buster=1420664121"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies">Signatures Capital</a>
</div>
<div class="pitch">
Supporting founders committed to inventing the future.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies" title="Airtable"><img alt="Airtable" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/623000-9d210a39051abc7accec1dc686888dcc-thumb_jpg.jpg?buster=1449952044"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies">Airtable</a>
</div>
<div class="pitch">
Organize anything you can imagine
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies" title="Meerkat"><img alt="Meerkat" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/630861-820b9d4af09e110b150c9affe418d860-thumb_jpg.jpg?buster=1425688408"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies">Meerkat</a>
</div>
<div class="pitch">
Live Stream Video.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies" title="Flight Ventures"><img alt="Flight Ventures" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/658877-89ccd88502db9d964a651ecba6f86d9d-thumb_jpg.jpg?buster=1457552637"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies">Flight Ventures</a>
</div>
<div class="pitch">
Investing in the Top Companies and Entrepreneurs
</div>
</div>
</div>
</div>]

There are more filters etc.. you can set, if you want to see how just select them in the browser and watch how the requests are made in firebug or developer tools under the xhr tab under Network.