Thanks for the update Dyana. I'm also not an expert in running one's own ZooKeeper cluster. It might be related to setting the ZooKeeper cluster properly up. Maybe someone else from the community has experience with this. Therefore, I'm cross posting this thread to the user ML again to have a wider reach.
Cheers,
Till
Like all the best problems, I can't get this to reproduce locally.
Everything has worked as expected. I started up a test job with 5 retained checkpoints, let it run and watched the nodes in zookeeper.
Then shut down and restarted the Flink cluster.
The ephemeral lock nodes in the retained checkpoints transitioned from one lock id to another without a hitch.
So that's good.
As I understand it, if the Zookeeper cluster is having a sync issue, ephemeral nodes may not get deleted when the session becomes inactive. We're new to running our own zookeeper so it may be down to that.