John Engelhart John Engelhart - 3 months ago 69
Scala Question

Apache Spark Handling Skewed Data

I have two tables I would like to join together. One of them has a very bad skew of data. This is causing my spark job to not run in parallel as a majority of the work is done on one partition.

I have heard and read and tried to implement salting my keys to increase the distribution.
https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.

Any help or tips would be appreciated. Thanks!

Answer

Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:

Here are a couple of suggestions:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin

python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

The tresata library looks like this:

import com.tresata.spark.skewjoin.Dsl._  // for the implicits   

// skewjoin() method pulled in by the implicits
rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),   
DefaultSkewReplication(1)).sortByKey(true).collect.toLis