adaaaam adaaaam - 6 months ago 14
Python Question

Can you tell me why this web scraper isn't able to log in correctly?

I'm trying to make a web scraper to get some information from the site Colloquy.com, on which I have an account. I am having trouble getting my scraper to log in to the site though. I'm using Python 2.7 with BeautifulSoup and Requests.

Here is a screenshot of my code

and here is a screenshot of the relevant HTML for the login

I've tried several variations of this code, including adding the authorization key to the log-in info. However, no matter what I've tried, I always get the "un-logged-in version" of the site when I get the HTML.

I have a suspicion that this has something to do with the site's use of Javascript for the login (it uses a pop up box instead of a separate login page). However, I don't know enough about Javascript to handle this properly, and I haven't been able to find any sort of guide that's illuminating on this particular issue.

So hopefully someone can tell me what is wrong with my code/process or where I can find out how to deal with logins using Javascript.

Thanks! :)

Answer

Instead of attempting to scrape the login page where the javascript is, it appears they post the information to https://colloquy.com/app/account/login, so you could do something like the following to try and login.

import requests
resp = requests.post("https://colloquy.com/app/account/login", data={"email":"some.email@address.com","password":"Password"})

You could then use the resp.cookies to scrape the pages that you are wanting to get to.

cookies = resp.cookies
r = requests.get("https://colloquy.com/some-page", cookies=cookies)
# Get html etc

Edit: Usually in the case of a login page there will be a post action behind the scenes that will send the required information to login. Usually username and password etc. This can usually be found on Chrome using the Developer Tools or Firefox with Developer Tools or Firebug. In order to get where it will post the information I bring up the tools and will then complete the login prompt. Within the Network tab (Chrome--may vary for Firefox/Firebug) it will usually show a request to some page (usually login or something similar) after you have completed the login prompt/page and submitted your information. Clicking on this action will allow you to see some of the information for this request including the Request Url and Request Method. There will also be an area which will show the Form Data posted to the Request Url. You should then be able to use this information to make a similar POST to the Request Url with the Form Data.

Note: There are cases where the web developer may attempt to block certain User-agents in order to keep automated scripts and/or bots away, but you can usually just change the user-agent to a normal agent to bypass this restriction.

requests.post(url, headers={"user-agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"})