Shankar Shankar - 4 years ago 246
Scala Question

Spark Streaming - Batch Interval vs Processing time

We have a

Spark Streaming application
running on YARN Cluster.

It receiving messages from
Kafka topics

Actually our Processing time is more than the batch interval.

Batch Interval : 1 Minute
Processing Time : 5 Minutes

I would like to know , what happens if some data is received in between the processing time, will the data available in memory till the processing over. Or it will be overridden in the subsequent data fetching?

We are using
Direct Streaming approach
to fetch data from Kafka topics.

Should i go with
Window based
operations? for example if i have
Window length as 5 Minutes and Sliding interval as 2 Minutes and Batch Interval as 1 Minute
, will it work?? Because we cannot lose any data in our application.

Answer Source

In the direct streaming approach, data isn't read by a receiver and then dispatched to other workers. What happens is the driver reads the offsets from Kafka, and then sends each partition with a subset of the offsets to be read.

If your workers haven't finished processing the previous job, they won't start processing the next one (unless you explicitly set spark.streaming.concurrentJobs to more than 1). This means that the offsets will be read, but won't actually dispatch to the executors responsible for reading the data, thus there won't be any data lose whatsoever.

What this does mean is that your job is going to infinitely be late and cause massive processing delays, which isn't something you want. As a rule of thumb any Spark jobs processing time should be less than the interval set for that job.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download