Its a continuation of this question: Porting a multi-threaded compute intensive job to spark
I am using
Old post here, but I ended up with this solution.
JavaPairRDD<Long, YourType> where Long is the ID and YourTypeHere can be a list, object anything what data in postgres or mongo reflects for an ID.
Say we end up with 3 data sources for an ID. Now we join all those 3 JavaPairRDDs and end up with a localized tuple.
JavaRDD<Tuple4<Long, YourType1, YourType2, YourType3>>. This tuple we end up with is all I need, which will be passed to the simulation logic.
Once we have our
repartition(n) it where n=total number of partitions I can handle.
I didn't have to do in memory lookup by properly localizing data inside an RDD (step2). By removing any I/O (db calls) within distributed tasks (i.e. code running within map like operations in spark), I achieved expected execution time. I am not able to process about 200 IDs in an hour now (most time is taken in the actual simulation logic), data loading etc takes <10mins.