Sid Sid - 4 months ago 15
Python Question

Cannot save collect-ed RDD to local file system of Driver

I am trying to save an RDD after calling collect() on it. I invoke spark-submit on Host-1 (I am assuming the Driver is the host from which I invoke the spark-submit script so in this case Host-1 is the Driver), get some data from HBase, run some operations on it and then call collect() on the RDD and iterate over the collected list and save it to a local file system file. In essence:

if __name__ == "__main__":
sc = SparkContext(appName="HBaseInputFormat")
# read the data from hbase
# ...
# ...
output = new_rdd.collect()

with open("/var/tmp/tmpfile.csv", 'w') as tmpf:
for o in output:
print (o)
tmpf.write("%s\n"%str(o))
tmpf.close()


This actually works fine with the data being saved in /var/tmp/tmpfile.csv except for the fact that the data is being saved on a different host than the Driver , let's say Host-3.
I am under the impression that collect would always collect the distributed data set on the Driver host and hence the file should be created on the Driver as well.
Where am I wrong?

Answer

I am assuming the Driver is the host from which I invoke the spark-submit script so in this case Host-1 is the Driver

That it not correct! See the documentation on running spark on yarn.

In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

You are likely running spark in yarn-cluster mode, and the driver is chosen to be on one of the nodes within the cluster.

Change this to yarn-client and the driver will run on the node from which you submitted the job.

Comments