Re: Flink disaster recovery test problems

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Flink disaster recovery test problems

Zhu Zhu
Hi Zhong,

Looks you are assigning tasks to different slot sharing groups to force them to not share the same slot.
So you will need at least 2 slots for the streaming job to start running successfully.
Killing one of the 2 TM, one slot in each, will lead to insufficient slots and your job will hang at slot allocation.

Task states are needed to not skip unprocessed source data, thus to avoid data loss. It's also needed if you want the failed task to recovery to the state right before failure.
Checkpointing is needed to persist the task states. If it is not enabled, the job will restart with the initial state, i.e. the job will consume data from the very beginning and there can be a big data regression.

Thanks,
Zhu Zhu

钟旭阳 <[hidden email]> 于2019年11月5日周二 下午3:01写道:
hello:


I am currently learning flink.I recently had a problem with Flink for disaster recovery testing.I tried to find an answer on the official website and blog but failed.I am trying to find community help.


The current situation is:I have two servers, each with one slot.My application has two parallel operators with a degree of parallelism of 1, using the slotSharingGroup function to make them run in these two slots respectively.


My disaster recovery test is to shut down one of the servers. But is it possible that two parallel operators compete for the same server slot? In addition to this,I want to dynamically add or remove servers (simulated power failures,etc) while Flink is running, but I think this must cause stream data loss. Is it only one way to restart Flink through the checkpoint mechanism to ensure that data is not lost and the number of servers is dynamically configured?


Best
Zhong