New Man New Man - 1 month ago 5x
Apache Configuration Question

Distributed Web crawling using Apache Spark

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?


How about this way:

Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:

  1. split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well: for example: you have to crawl from 20150301 to 20150401, split results can be: [,, ...,]
  2. assign each base url( to a single thread, it is in the threads where the really data fetch happens
  3. save the result of each thread into FileSystem.

When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:

  1. Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
  2. Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
  3. save the rdd into HDFS.

The final program looks like:

class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}

class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {

  override protected def getPartitions: Array[CrawlPartition] = {
    val partitions = new ArrayBuffer[CrawlPartition]
    //split baseURL to subsets and populate the partitions

  override def compute(part: Partition, context: TaskContext): Iterator[X] = {
    val p = part.asInstanceOf[CrawlPartition]
    val baseUrl = p.baseURL

    new Iterator[X] {
       var nextURL = _
       override def hasNext: Boolean = {
         //logic to find next url if has one, fill in nextURL and return true
         // else false

       override def next(): X = {
         //logic to crawl the web page nextURL and return the content in X

object Crawl {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Crawler")
    val sc = new SparkContext(sparkConf)
    val crdd = new CrawlRDD("baseURL", sc)