Luke Luke - 2 months ago 9x
Scala Question

Spark: shuffle operation leading to long GC pause

I'm running

Spark 2
and am trying to shuffle around 5 terabytes of json. I'm running into very long garbage collection pauses during shuffling of a

val operations =[MyClass]
operations.repartition(partitions, operations("id")).write.parquet("s3a://foo")

Are there any obvious configuration tweaks to deal with this issue? My configuration is as follows:

spark.driver.maxResultSize 6G
spark.driver.memory 10G
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError
spark.executor.memory 32G
spark.hadoop.fs.s3a.buffer.dir /raid0/spark
spark.hadoop.fs.s3n.buffer.dir /raid0/spark
spark.hadoop.fs.s3n.multipart.uploads.enabled true
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadata false
spark.local.dir /raid0/spark
spark.memory.fraction 0.8
spark.mesos.coarse true
spark.mesos.constraints priority:1
spark.mesos.executor.memoryOverhead 16000 600
spark.rpc.message.maxSize 1000
spark.speculation false
spark.sql.parquet.mergeSchema false
spark.sql.planner.externalSort true
spark.submit.deployMode client
spark.task.cpus 1


Adding the following flags got rid of the GC pauses.

spark.executor.extraJavaOptions -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12

I think it does take a fair amount of tweaking though. This databricks post was very very helpful.