divadpoc divadpoc - 9 months ago 97
Java Question

Authentication with crawler4j

My goal is to log-in to a site and then get my account information. I'm using crawler4j 4.2

AuthInfo authJavaForum = new FormAuthInfo("myuser", "mypwd", "http://www.java-forum.org", "login", "password");
PageFetcher pf = new PageFetcher(config);
CrawlController ctrl = new CrawlController(config, pf, robotsts);
// add the page I want as seed
ctrl.startNonBlocking(BasicCrawler.class, 5);

in logging I see that the authentication was successfull, and I also see that the http client connection now contains a cookie, containing the session I got from the page. But it seems that I'm still missing something, the request to get my personal details is failing with error code 403 (forbidden) as if I wasn't logged in.

I used wireshark in order to see the difference when using crawler4j and logging in by hand, but the requests seem to be identical (the biggest difference is that my cookie doesn't contain any info about ga (google analytics))

1) how is it possible to stay logged-in?

2) can there be any other issue that is preventing me to stay logged-in?

3) is there any site that is actually working with crawler4j?

what I've tried so far: (cloned the repository)

a) setting a CookieStore within constructor of PageFetcher (although it's created by default within the http-client library)

b) in fetchPage (within PageFetcher) I created a HttpClientContext, set the cookieStore, and passed it along to the execute method

no success though.

I also tried webmagic and extended it with my own downloader/httpClientGenerator in order to support (form)authentication, but I'm having the same problem

related question: Crawler4j with authentication


This really is embarrassing. after checking the page again, especially the form, I realized that the action is pointing to login/login. Thus, when changing the URL to http://www.java-forum.org/login/login within my AuthInfo I get my personal details.