Re: Flink-1.6.1 :: HighAvailability :: ZooKeeperRunningJobsRegistry

Posted by Andrey Zagrebin on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Flink-1-6-1-HighAvailability-ZooKeeperRunningJobsRegistry-tp24115p24132.html

Hi Mike,

What was the full job life cycle? 
Did you start it with Flink 1.6.1 or canceled job running with 1.6.0? 
Was there a failover of Job Master while running before the cancelation?
What version of Zookeeper do you use?

Flink creates child nodes to create a lock for the job in Zookeeper.
Lock is removed by removing child node (ephemeral).
Persistent node can be a problem because if job dies and does not remove it, 
persistent node will not timeout and disappear as ephemeral one 
and the next job instance will not delete it because it is supposed to be locked by the previous.

There was a recent fix in 1.6.1 where the job data was not properly deleted from Zookeeper [1].
In general, it should not be the case and all job related data should be cleaned from Zookeeper upon cancellation.

Best,
Andrey

[1] https://issues.apache.org/jira/browse/FLINK-10011

On 25 Oct 2018, at 15:30, Mikhail Pryakhin <[hidden email]> wrote:

Hi Flink experts!

When a streaming job with Zookeeper-HA enabled gets cancelled all the job-related Zookeeper nodes are not removed. Is there a reason behind that? 
I noticed that Zookeeper paths are created of type "Container Node" (an Ephemeral node that can have nested nodes) and fall back to Persistent node type in case Zookeeper doesn't support this sort of nodes. 
But anyway, it is worth removing the job Zookeeper node when a job is cancelled, isn't it?

Thank you in advance!

Kind Regards,
Mike Pryakhin