MaatDeamon MaatDeamon - 21 days ago 10
Scala Question

Spark for asynchronous updates.

Red in the book, spark in action the following:

"Spark isn’t suitable, though, for asynchronous updates to shared data (such as online transaction processing, for example), because it has been created with batch analytics in mind. (Spark streaming is simply batch analytics applied to data in a time window.) Tools specialized for those use cases will still be necessary."

Can someone explain was is meant by it ?

I am interested in using sparks to perform some ETL process. As a side note i intent to use kafka in the middle. Although i do not understand the issue. Because taking data from Kafka and writing it in a database would be somewhat the same issue. It would be done in parrallel.

Answer

Spark streaming works in small batches - i.e. every X time , spark reads all the data available since last read from the streaming source. It them processes all that data together.

This batch work means that updating downstream systems have inherent latency (the X time) unlike other tools (e.g. Flink, Apex) that work record by record. Note however, that when it comes to updating OLTP destinations, if you can live with the latency , you might actually get better throughput as batch updates are usually more efficient