David Schuler David Schuler - 7 days ago 4x
Scala Question

Spark: Caching an RDD/DF for use across multiple programs

I have a dataset that is being read from multiple programs. Instead of reading this dataset into memory a number of times each day, is there a way for spark to effectively cache the dataset, allowing any program to call upon it?


RDDs and Datasets cannot be shared between application (at least, there is no official API to share memory)

However, you may be interested in Data Grid. Look at Apache Ignite. You can i.e. load data to Spark, preprocess it and save to grid. Then, in other applications you could just read data from Ignite cache.

There is a special type of RDD, named IgniteRDD, which allows you to use Ignite cache just like other data sources. Of course, like any other RDD, it can be converted to Dataset

It would be something like this:

val rdd = igniteContext.fromCache("igniteCache")
val dataFrame = rdd.toDF

More information about IgniteContext and IgniteRDD you can find here