We have a
Spark Streaming application
Kafka topics
Batch Interval : 1 Minute
Processing Time : 5 Minutes
Direct Streaming approach
Window based
Window length as 5 Minutes and Sliding interval as 2 Minutes and Batch Interval as 1 Minute
In the direct streaming approach, data isn't read by a receiver and then dispatched to other workers. What happens is the driver reads the offsets from Kafka, and then sends each partition with a subset of the offsets to be read.
If your workers haven't finished processing the previous job, they won't start processing the next one (unless you explicitly set spark.streaming.concurrentJobs
to more than 1). This means that the offsets will be read, but won't actually dispatch to the executors responsible for reading the data, thus there won't be any data lose whatsoever.
What this does mean is that your job is going to infinitely be late and cause massive processing delays, which isn't something you want. As a rule of thumb any Spark jobs processing time should be less than the interval set for that job.