ethrbunny ethrbunny - 3 months ago 33
Java Question

kafkastreams - adding more processing capacity

Im working on a POC converting an existing Flink application / topology to use KafkaStreams. My question is about deployment.

Specifically - in Flink one adds "Worker Nodes" to the flink installation and then adds more parallelization to the topology to keep up with increasing data rates.

How does one increase KStreams capacity as the data rate increases? Does KStreams handle this automatically? Do I launch more processes (ala Micro-services)?

Or am I missing the big picture here?

Answer

Do I launch more processes (ala Micro-services)?

The short answer is yes:

  • Answer 1 (adding capacity): To scale out, you simply start another instance of your stream processing application, e.g. on another machine. The instances of your application will become aware of each other and automatically begin to share the processing work.
  • Answer 2 (removing capacity): Simply stop one or more running instances of your stream processing application, e.g. shut down 2 of 4 running instances. The remaining instances of your application will become aware that other instances were stopped and automatically take over the processing work of the stopped instances.

See the Kafka Streams documentation at http://docs.confluent.io/3.0.0/streams/developer-guide.html#elastic-scaling-of-your-application for further details (unfortunately the Apache Kafka docs on Kafka Streams don't have these details yet).

Or am I missing the big picture here?

The big picture is that the picture is actually nice and small. :-)

So let me add the following, because I feel that many users are confused by the complexity of other, related technologies and then don't really expect that you can do stream processing (including its deployment) in a much simpler way, like what you can do with Kafka Streams:

A Kafka Streams application is a normal, plain old Java application that happens to use the Kafka Streams library. A key differentiator to existing stream processing technologies is that, by using the Kafka Streams library, your application becomes scalable, elastic, fault-tolerant, and so on, without requiring a special "processing cluster" to add machines to, like you'd do for Flink, Spark, Storm, etc. The Kafka Streams deployment model is much simpler and easier: just start or stop additional instances of your application (i.e. literally the same code). This works with basically any deployment related tool, including but not limited to Puppet, Ansible, Docker, Mesos, YARN. You can even do that manually by running java ... YourApp.

Comments