ptikobj ptikobj - 1 year ago 178
Scala Question

Spark Standalone Mode: How to compress spark output written to HDFS

Related to my other question, but distinct:


If I save an RDD to HDFS, how can I tell spark to compress the output with gzip?
In Hadoop, it is possible to set

mapred.output.compress = true

and choose the compression algorithm with

mapred.output.compression.codec = <<classname of compression codec>>

How would I do this in spark? Will this work as well?

edit: using spark-0.7.2

Answer Source

The method saveAsTextFile takes an additional optional parameter of the codec class to use. So for your example it should be something like this to use gzip:

someMap.saveAsTextFile("hdfs://HOST:PORT/out", classOf[GzipCodec])


Since you're using 0.7.2 you might be able to port the compression code via configuration options that you set at startup. I'm not sure if this will work exactly, but you need to go from this:

conf.set("mapred.output.compress", "true")
conf.set("mapred.output.compression.codec", c.getCanonicalName)
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString)

to something like this:

System.setProperty("spark.hadoop.mapred.output.compress", "true")
System.setProperty("spark.hadoop.mapred.output.compression.codec", "true")
System.setProperty("spark.hadoop.mapred.output.compression.codec", "")
System.setProperty("spark.hadoop.mapred.output.compression.type", "BLOCK")

If you get it to work, posting your config would probably be helpful to others as well.