I am trying to save an RDD after calling collect() on it. I invoke spark-submit on Host-1 (I am assuming the Driver is the host from which I invoke the spark-submit script so in this case Host-1 is the Driver), get some data from HBase, run some operations on it and then call collect() on the RDD and iterate over the collected list and save it to a local file system file. In essence:
if __name__ == "__main__":
sc = SparkContext(appName="HBaseInputFormat")
# read the data from hbase
output = new_rdd.collect()
with open("/var/tmp/tmpfile.csv", 'w') as tmpf:
for o in output:
I am assuming the Driver is the host from which I invoke the spark-submit script so in this case Host-1 is the Driver
That it not correct! See the documentation on running spark on yarn.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
You are likely running spark in yarn-cluster mode, and the driver is chosen to be on one of the nodes within the cluster.
Change this to yarn-client and the driver will run on the node from which you submitted the job.