vick vick - 27 days ago 34
Java Question

Writing Spark Streaming Output to a Socket

I have a DStream "Crowd" and I want to write each element in "Crowd" to a socket. When I try to read from that socket, it dosen't print anything. I am using following line of code:

val server = new ServerSocket(4000,200);
val conn = server.accept()
val out = new PrintStream(conn.getOutputStream());
crowd.foreachRDD(rdd => {rdd.foreach(record=>{out.println(record)})})


But if use (this is not I want though):

crowd.foreachRDD(rdd => out.println(rdd))


It does write something to the socket.

I suspect there is a problem with using rdd.foreach(). Although it should work. I am not sure what I am missing.

Answer

The code outside the DStream closure is executed in the driver, while the rdd.foreach(...) will be executed on each distributed partition of the RDD. So, there's a socket created on the driver's machine and the job tries to write to it on another machine - that will not work for the obvious reasons.

DStream.foreachRDD is executed on the driver, so in that instance, the socket and the computation are performed in the same host. Therefore it works.

With the distributed nature of an RDD computation, this Server Socket approach will be hard to make work as dynamic service discovery becomes a challenge i.e. "where is my server socket open?". Look into some system that will allow you to have centralized access to distributed data. Kafka is a good alternative for this kind of streaming process.