sarthak sarthak - 1 month ago 23
Scala Question

Partitioning in Spark

I created an RDD by parallelizing the following array:

var arr: Array[(Int,Char)] = Array()
for (i <- 'a' to 'z') {arr = arr :+ (1,i)} // Key 1 has 25 elements
for (i <- List.range('a','c')) {arr = arr :+ (2,i)} // Key 2 has 2
for (i <- List.range('a','f')) {arr = arr :+ (3,i)} // Key 3 has 5
val rdd = sc.parallelize(arr,8)


I wanted to partition the above RDD so that every partition receives different key and the partitions are of almost the same size. The code below allows me to partition the RDD by keys:

val prdd = rdd.partitionBy(new HashPartitioner(3))


The partitions created by the above code have the following sizes:

scala> prdd.mapPartitions(iter=> Iterator(iter.length)).collect
res43: Array[Int] = Array(25, 2, 5)


Is there a way I can make the partitions of almost equal size from this rdd? So for example for the case above the key 1 has the largest partition size of 25. Can I have partition size like:

Array[Int] = Array(5, 5, 5, 5, 5, 2, 5)


I tried doing the
RangePartition
on the above
prdd
but it didn't work.

Answer

The issue you're running into is inherent in your data.

  1. Your keys have a very imbalanced distribution
  2. You want all keys to be grouped together.

There's really no way to have even distribution given these two! If you print the partition sizes when you first call parallelize, you'll see that the partitions are relatively balanced -- sc.parallelize will chunk the data evenly.

Spark partitioners provide a deterministic function from Key K to partition index p. There's no way to have several partitions for the "1" key while preserving this function. Range partitions are useful for maintaining order on an RDD, but won't help here -- for any given key, there can only be one partition you need to look in.

Are you partitioning so that you can do key/value RDD operations like join or reduceByKey later? If so, then you're out of luck. If not, then we can play some tricks with partitioning by key/value combinations rather than just key!