dot.Py dot.Py -4 years ago 265
jQuery Question

How can I scrape data that is locked by a button?

I'm trying to fetch some info from a website, without success.

The problem is that the data is shown only after clicking a certain button.

first_page

The info that I want is located in this tag:

<div id="frmContact" class="contactForm hidden"></div>
<div class="btn btn-secondary viewnumber phone-trigger" data-ga-action="header">
<a href="#" rel="nofollow">Ver telefone</a>
<i class="icon"></i>
</div>


It may have something to do with this line:

<form action="/noindex/doctor-phone" id="frmPhone" method="post"><input name="__RequestVerificationToken" type="hidden" value="3uFb11EKzbTh4TWoqXk025U7jS7QoV5-od7lSgSBzdu616u82jQAHiOTl2aB3q47aRCIg2CjVCjE6R6bUAqDplAOfeM1" /><input id="entityKey" name="entityKey" type="hidden" value="12898671" /><input id="placeType" name="placeType" type="hidden" value="" /><input id="placeKey" name="placeKey" type="hidden" value="" /></form> <div id="phonePlacer"></div>


But I don't know how to use this
__RequestVerificationToken
properly.




Do I have to send a request to the server using this info to get the phone info? If so, how?

After I click the button, this is the popup that appears (I'm interested in info1 to info4):

enter image description here

My code:

page = BeautifulSoup(urllib2.urlopen('http://www.doctoralia.com.br/medico/RANDOM_PROFILE'), "html.parser")
hidden_tags = page.find_all("input", type="hidden")

for tag in hidden_tags:
print tag


Output:

<input name="__RequestVerificationToken" type="hidden" value="gPYstKvmi4xBQsV81ECf5mYe695igvq8E2QqtOgBPqtRybEP74OEbSAe8uDg8dlZCpqib94FIrUoPMnpLTC0tY7kiJE1"/>
<input id="entityKey" name="entityKey" type="hidden" value="14336768"/>
<input id="placeType" name="placeType" type="hidden" value=""/>
<input id="placeKey" name="placeKey" type="hidden" value=""/>

Answer Source

It is pretty straightforward using a requests.Session object, you just need to extract the __RequestVerificationToken token from the initial page and a couple of pieces of form data. I used the page of full listings to get the numbers and the link to the doctors page, the same logic applies wherever you decide to get the number from:

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin

head = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"}

base = "http://www.doctoralia.com.br/"

with requests.Session() as s:
    r = s.get('http://www.doctoralia.com.br/medicos/especialidade/dermatologistas-1314')
    page = BeautifulSoup(r.content, "html.parser")
    token = page.select_one("input[name=__RequestVerificationToken]"["value"])
    hidden_tags = page.select("article.media.doctor")
    for tag in hidden_tags:
        h3 = tag.select_one("h3")
        key = h3.a["data-track-click"]
        place = tag.select_one("span[data-location]")["data-location"].split("|", 1)[0]

       data = {"__RequestVerificationToken": token,
            "entityKey": key,
            "placeKey": place}
        resp = s.post("http://www.doctoralia.com.br/noindex/doctor-phone", data=data, headers=head)
        soup = BeautifulSoup(resp.content,"html.parser")
        print(urljoin(base,h3.a["href"]))
        print(soup.select_one("li.phone").text.strip())

That gets you all the links and the phone numbers for each doctor, anything you see in the popup when you click the button will be available to parse. The essential form data is the __RequestVerificationToken and the entityKey, the placeKey does not seem to affect the post but no harm including it. The headers are also not essential in this instance but again no harm always a good idea to add a user-agent. You might want to add a sleep between requests so you don't hammer the server if you are making a lot of requests. Also looking at the robots.txt:

User-agent: *
Disallow: /noindex/
Disallow: /usuarios/
Disallow: /users/
Disallow: /utilisateurs/
Disallow: /utenti/
Disallow: /gebruikers/
Disallow: /nutzer/
Disallow: /medical-center/m/
Disallow: /consultant/m/
Disallow: /centro-medico/m/
Disallow: /medico/m/
Disallow: /centre-medical/m/
Disallow: /medicin/m/
Disallow: /centro-medico/m/
Disallow: /medico/m/
Disallow: /centri-medici/m/
Disallow: /medecin/m/
Disallow: /healthpro/m/
Disallow: /facharzt/m/
Disallow: /sanit�tszentrum/m/
Disallow: /clickfav/
Disallow: /clicktlf/
Disallow: /reservas/
Disallow: /citas/
Disallow: /medisch-centrum/m/
Disallow: /deskundige/m/
Disallow: /arzt/m/
Disallow: /klinik/m/
Disallow: /citas/
Disallow: /turnos/
Disallow: /appuntamenti/
Disallow: /appointments/
Disallow: /consultas/
Disallow: /ws/Schedules.asmx/
Disallow: /RESOURCE NOT FOUND/
Disallow: /RESOURCE+NOT+FOUND/
Disallow: /RESOURCE%20NOT%20FOUND/
Disallow: /entities/

There is no user-agent restriction and what you are scraping is not disallowed

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download