Kot Kot - 2 months ago 38
Java Question

Apache spark in memory caching

Spark caches the working dataset into memory and then performs computations at memory speeds. Is there a way to control how long the working set resides in RAM?

I have a huge amount of data that is accessed through the job. It takes time to load the job initially to RAM and when the next job arrives, it has to load all the data again to RAM which is time consuming. Is there a way to cache the data forever(or for specified time) into RAM using Spark?


To uncache explicitly, you can use RDD.unpersist()

If you want to share cached RDDs across multiple jobs you can try the following:

  1. Cache the RDD using a same context and re-use the context for other jobs. This way you only cache once and use it many times
  2. There are 'spark job servers' that exist to do the above mentioned functionality. Checkout Spark Job Server open sourced by Ooyala.
  3. Use an external caching solution like Tachyon

I have been experimenting with caching options in Spark. You can read more here : http://sujee.net/understanding-spark-caching/