javadba javadba - 3 months ago 66
Scala Question

How to set hadoop configuration values from pyspark

The Scala version of SparkContext has the property

sc.hadoopConfiguration


I have successfully used that to set hadoop properties (in scala..)

e.g.

sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")


However the python version of SparkContext lacks that accessor. Is there any way to set hadoop configuration values into the Hadoop Configuration used by the pyspark context?

Answer

I looked into the pyspark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:

fileLines = sc.newAPIHadoopFile('dev/*', 
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()