I am using Spark to consume data from Kafka. On restarting Spark after consuming some data, how do I ensure that Spark will start consuming from the offset that it had left off?
For example, if one on the first run, Spark had consumed till offset id
SparkConf sparkConf = new SparkConf().setAppName("name");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String,String> kafkaParams = new HashMap<>();
JavaPairReceiverInputDStream<String, EventLog> messages =
KafkaUtils.createStream(jssc, String.class, EventLog.class, StringDecoder.class, EventLogDecoder.class,
kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER_2());
In the receiver based approach (which you're using via
KafkaUtils.createStream), the offsets are saved and handled by the WAL (Write Ahead Log), which is responsible for resuming your application from the proper offsets. If you want to be able to control exactly where you resume your application from, look into
KafkaUtils.createDStream and generally the Direct (receiverless) API streaming approach with the overload taking
fromOffsets: Map[TopicAndPartition, Long].