Ivan Assalim Ivan Assalim - 6 months ago 100
Java Question

JSOUP - How to crawl a "login required" page using JSOUP

I'm having trouble at crawling a determined website I wish to crawl. The problem is: after successfully logging in to that website I can't access a link which requires a valid login.

For example:

public Document executeLogin(String user, String password) {
try {
Connection.Response loginForm = Jsoup.connect(url)
.method(Connection.Method.GET)
.execute();

Document mainPage = Jsoup.connect(login-validation-url)
.data("user", user)
.data("senha", password)
.cookies(loginForm.cookies())
.post();

Document evaluationPage = Jsoup.connect(login-required-url)
.get();

return evaluationPage;
} catch (IOException ioe) {
return null;
}


What I do here is:


  • Get the cookies from the login page, so I can login properly;

  • Then I post to the login validation url, which returns the main page after log in;

  • Finally I try to access the login required url after logging in to the main page, but that request returns me the login page, as if the session had expired.



I know I have to store cookies to keep the session alive, but when I connect to the login validation url, it returns me a Document object, and there are no cookies to get from that object.

Is there any way to get the "session" created by the successful log in and send it within other Jsoup.connects? What I want to do, is to crawl a page that can only be accessed by logged users.

Thank you very much in advance.

Answer

Get the cookie after you login:

    Connection.Response loginForm = Jsoup.connect(url)
            .method(Connection.Method.GET)
            .execute();

    Connection.Response mainPage = Jsoup.connect(login-validation-url)
            .data("user", user)
            .data("senha", password)
            .cookies(loginForm.cookies())
            .execute();

    Map<String, String> cookies = mainPage.cookies();

    Document evaluationPage = Jsoup.connect(login-required-url)
            .cookies(cookies)
            .execute.parse();

   return evaluationPage;

When you get your second webpage, you also have to use the cookie:

(Source: I had this problem a few days ago)

So it's easier to just put the cookies in a Map:

Map<String, String> cookies = loginForm.cookies();

And submit the forms using these cookies.

Comments