user218750 user218750 - 4 months ago 17
Scala Question

How to combine two JavaPairRDD's using custom logic

I have two JavaPairRDD's.

JavaPairRDD<List<String>, CustomObject> originalData = ...;
JavaPairRDD<String, CustomField> newData = ...;


In this case, CustomField is a field in CustomObject. My goal is to combine the two datasets on the condition that the key from newData is in the key from originalData. So, if I have something like

originalData = ({"foo1", "foo2", "foo3"}, customObject1)

newData = ("foo1", customField1)

I would want to match these two items, and insert customField1 into customObject1. I looked into Cogroup and FullOuterJoin, but these functions match by key, which wouldn't work in this case since the keys are obviously different. What is the best way of combining these two datasets?

Answer

Do you need original shape? If not use cartesian:

originalData.cartesian(newData).filter(checkConditon);

You can also make it flat:

JavaPairRDD<String, CustomObject> flatData = originalData.flatMap(flatteningFunc);
flatData.join(newData);