Shankar Shankar - 9 months ago 51
Scala Question

Spark Streaming - How do i notify the Consumer once the Producer is done?

Is it possible to

, once the
publish all the data to
Kafka topic

There are possibilities the same data( with some unique field) is available in multiple partitions, so i need to group the data and do some calculation.

I thought of using
Sliding window
for this, but still the problem is we don't know whether the Producer is completed publishing the data?

The amount of messages is
around 50K
, Does Kafka can handle
50K messages[Single partition]
in seconds if we have brokers with better configurations?

Currently, we are planning to have multiple partitions to split the data based on
Default Partitioner

Any efficient way to handle this?


Every fifteen minutes once, the producer gets the data and it start publish the data to Kafka topic, i am sure this is a use case for batch, but this is our current design.

Answer Source

Spark Streaming doesn't work like that. The way it does work is of an infinite stream of data flowing in and getting processed at each batch interval. This means that if you want to signal a logical "end of batch", you'll need to send a message indicating that this batch of data is over, allowing you to send the processed messages to an output sink of your desire.

One way you can achieve this is by using stateful streams which aggregate data across batches and allow you to keep state between batch intervals.