Spark caches the working dataset into memory and then performs computations at memory speeds. Is there a way to control how long the working set resides in RAM?
I have a huge amount of data that is accessed through the job. It takes time to load the job initially to RAM and when the next job arrives, it has to load all the data again to RAM which is time consuming. Is there a way to cache the data forever(or for specified time) into RAM using Spark?
To uncache explicitly, you can use RDD.unpersist()
If you want to share cached RDDs across multiple jobs you can try the following:
I have been experimenting with caching options in Spark. You can read more here : http://sujee.net/understanding-spark-caching/