Dom Shahbazi Dom Shahbazi - 2 months ago 10
Java Question

Implementing Threads Into Java Web Crawler

Here is the original web crawler in which i wrote: (Just for reference)

https://github.com/domshahbazi/java-webcrawler/tree/master


This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing more pages in less time. Below is each class:

Main class

public class controller {

public static void main(String args[]) throws InterruptedException {

DataStruc data = new DataStruc("http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12");

Thread crawl1 = new Crawler(data);
Thread crawl2 = new Crawler(data);

crawl1.start();
crawl2.start();
}
}


Crawler Class (Thread)

public class Crawler extends Thread {

/** Instance of Data Structure **/
DataStruc data;

/** Number of page connections allowed before program terminates **/
private final int INDEX_LIMIT = 10;

/** Initial URL to visit **/
public Crawler(DataStruc d) {
data = d;
}

public void run() {

// Counter to keep track of number of indexed URLS
int counter = 0;

// While URL's left to visit
while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

// Pop next URL to visit from stack
String currentUrl = data.getURL();

try {
// Fetch and parse HTML document
Document doc = Jsoup.connect(currentUrl)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")
.referrer("http://www.google.com")
.timeout(12000)
.followRedirects(true)
.get();

// Increment counter if connection to web page succeeds
counter++;

/** .select returns a list of elements (links in this case) **/
Elements links = doc.select("a[href]"); // Relative URL

// Add newly found links to stack
addLinksToQueue(links);

} catch (IOException e) {
//e.printStackTrace();
System.out.println("Error: "+currentUrl);
}
}
}

public void addLinksToQueue(Elements el) {
// For each element in links
for(Element e : el) {

String theLink = e.attr("abs:href"); // 'abs' prefix ensures absolute url is returned rather then relative url ('www.reddit.com/hello' rather then '/hello')

if(theLink.startsWith("http") && !data.oldLink(theLink)) {
data.addURL(theLink);
data.addVisitedURL(theLink); // Register each unique URL to ensure it isnt stored in 'url_to_visit' again
System.out.println(theLink);
}
}
}
}


DataStruc Class

public class DataStruc {

/** Queue to store URL's, can be accessed by multiple threads **/
private ConcurrentLinkedQueue<String> url_to_visit = new ConcurrentLinkedQueue<String>();

/** ArrayList of visited URL's **/
private ArrayList<String> visited_url = new ArrayList<String>();

public DataStruc(String initial_url) {
url_to_visit.offer(initial_url);
}

// Method to add seed URL to queue
public void addURL(String url) {
url_to_visit.offer(url);
}

// Get URL at front of queue
public String getURL() {
return url_to_visit.poll();
}

// URL to visit size
public int url_to_visit_size() {
return url_to_visit.size();
}

// Add visited URL
public void addVisitedURL(String url) {
visited_url.add(url);
}

// Checks if link has already been visited
public boolean oldLink(String link) {
for(String s : visited_url) {
if(s.equals(link)) {
return true;
}
}
return false;
}
}


DataStruc is the shared data structure class, which will be concurrently accessed by each instance of a Crawler.java thread. DataStruc has a Queue to store links to be visited, and an arraylist to store visited URL's, to prevent entering a loop. I used a ConcurrentLinkedQueue to store the urls to be visited, as i see it takes care of concurrent access. I didnt require concurrent access with my arraylist of visited urls, as all i need to be able to do is add to this and iterate over it to check for matches.

My problem is that when i compare operation time of using a single thread VS using 2 threads (on the same URL), my single threaded version seems to work faster. I feel i have implemented the threading incorrectly, and would like some tips if anybody can pinpoint the issues?

Thanks!

Answer

Added: see my comment, I think the check in Crawler

// While URL's left to visit
        while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

is wrong. The 2nd Thread will stop immediately since the 1st Thread polled the only URL.

You can ignore the remaining, but left for history ...

My general approach to such types of "big blocks that can run in parallel" is:

  1. Make each crawler a Callable. Probably Callable<List<String>>
  2. Submit them to an ExecutorService
  3. When they complete, take the results one at a time and add them to a List.

Using this strategy there is no need to use any concurrent lists at all. The disadvantage is that you don't get much live feedback as they are runnìng. And, if what they return is huge, you may need to worry about memory.

Would this suit your needs? You would have to worry about the addVisitedURL so you still need that as a concurrent data structure.

Added: Since you are starting with a single URL this strategy doesn't apply. You could apply it after the visit to the first URL.

Comments