Vlad Vlad - 3 months ago 11
Java Question

Is map in State object the same on all nodes?

This Spark App is running on 3 nodes. I have a State object(MessageState) that contains a HashMap. This HashMap contains a Graph(key leaf, value parent)(and no, GraphX is not a solution for this) Let's say the State object will become to big to fit on one node so it will be distributed on the other 2 nodes. If I want to know for a leaf it's most upper parent(it will do some recursive function to go through the whole map) is it a possibility that let's say that the leaf is on node 3 and the most upper parent is on node 1 and it will not find it or spark distribution will take care of that so the whole map data will be available for searching. My question in fact is how is the State distribution working.

JavaPairDStream<String, String> inputMessagesStream = readFromKafkaStream1();
Function3<String, Optional<String>, State<MessageState>, String> messageState = (key, value, state) -> {
//MessageState contains the HashMap
if (state.exists()) {
return state.get().process(value.get());
} else {
MessageState ms = new MessageState();
ms.process(value.get());
state.update(ms);
return null;
}
};

JavaMapWithStateDStream<String, String, MessageState, String> message1 = inputMessagesStream.mapWithState(StateSpec.function(messageState));

Answer

"Return a JavaMapWithStateDStream by applying a function to every key-value element of this stream, while maintaining some state data for each unique key."

Since all values for a single key in a PairRDDStream are on a single node, the state for that key lives on the same node as well (if you have too many values, they can end up on several nodes, but Spark will still try to minimize amount of data it has to transfer). You can't access state for different keys from mapWithState, so "is it a possibility that let's say that the leaf is on node 3 and the most upper parent is on node 1 and it will not find it" doesn't apply.