We have a flink job containing almost 20 process functions (map, flatMap, process, filter, etc.) The state dependencies among those process functions are very complex:
- Shared states are several key-value maps.
- Different functions share different states.
- Functions may query and modify states.
- We want those states to be single source of truth.
We have tried two ways to solve the problem, both were not good enough:
- We tried to redesign stream graph -- using CoFlatMapFunction to store the states and union different types of messages. That made our stream much more complex
and hard to ensure the correctness.
- We tried to use outside storage, like redis, zookeeper (poor performance) and mysql. In this case, our code was much much more simpler and it did run correctly.
However, when we gave functions a high parallelism, we had bad performance issue -- Since each parallel sub task not shared connections, it wasted many network and hardware resources.
Are there other better ways to deal with shared state problem in flink? If not, which way I mentioned is better? Could any flink master give us some best practices.
Thanks.
Edit: We migrated from akka actor system and akka stream. In akka, sharing a singleton state is not that hard -- just add a state actor.
StackOverflow link:
https://stackoverflow.com/questions/47388411/single-source-of-truth-for-states-among-multiple-process-functions