Fanooos Fanooos - 1 month ago 24
Java Question

How to join two (or more) streams (JavaDStream) in apache spark

We have a spark streaming application that consumes Gnip compliance stream.

In the old version of the API, the compliance stream was provided by one end point but now it is provided by 8 different endpoints.

We could run the same spark application 8 times with different parameters to consume different endpoints.

Is there a way in spark streaming to consume the 8 endpoints and merge them into one in the same application?

Should we use different streaming context for each connection or one context is enough?

Answer

I think you are looking for Spark union here.

Read following for examples Concatenating datasets of different RDDs in Apache spark using scala

As per Spark documentation Spark union :

Return a new dataset that contains the union of the elements in the source dataset and the argument.

Comments