(DEPRECATED) Apache Flink User Mailing List archive.

RE: Old Flink jobs restarting on Job Manager failover

Classic

List

Threaded

2 messages Options

Stephen.Hesketh

RE: Old Flink jobs restarting on Job Manager failover

Hi all,

We have a Flink environment using zookeeper to manage the cluster. The high availability option is set up with the high-availability.storageDir parameter set to a shared directory on NAS; this is available to all nodes.

When zookeeper fails over to the standby JobManager during a cluster change, we see old jobs that have long been cancelled being restarted automatically by Flink. It seems like the standby JobManager is reconnecting with old configuration and old job details.

I can’t see anything in the log that gives any indication why this old job is restarting. I have noticed that the blob.storage.directory is set to a local directory.

Are there any other settings in Flink that might cause a Job Manager to restart against an old local state rather than the latest shared state?

Thanks,

Steve

Stephen Hesketh
Reporting Shared Services, NatWest Markets

250 Bishopsgate, London EC2M 4AA

Office: +44 (0)20 7678 1482 (internal 381482) | Mobile: +44 (0)7968 039848

******************************************************************

NatWest Markets is a marketing name of The Royal Bank of Scotland plc.

This communication and any attachments are confidential and intended solely for the addressee. If you are not the intended recipient please advise us immediately and delete it. Unless specifically stated in the message or otherwise indicated, you may not duplicate, redistribute or forward this message and any attachments are not intended for distribution to, or use by any person or entity in any jurisdiction or country where such distribution or use would be contrary to local law or regulation. The Royal Bank Of Scotland plc or any affiliated entity ("RBS") accepts no responsibility for any changes made to this message after it was sent.

Unless otherwise specifically indicated, the contents of this communication and its attachments are for information purposes only and should not be regarded as an offer or solicitation to buy or sell a product or service, confirmation of any transaction, a valuation, indicative price or an official statement. This communication has been prepared by the RBS trading desk, which may have a position or interest in the products or services mentioned that is inconsistent with any views expressed in this message. In evaluating the information contained in this message, you should know that it could have been previously provided to other clients and/or internal RBS personnel, who could have already acted on it.

RBS cannot provide absolute assurances that all electronic communications (sent or received) are secure, error free, not corrupted, incomplete or virus free and/or that they will not be lost, mis-delivered, destroyed, delayed or intercepted/decrypted by others. Therefore RBS disclaims all liability with regards to electronic communications (and the contents therein) if they are corrupted, lost destroyed, delayed, incomplete, mis-delivered, intercepted, decrypted or otherwise misappropriated by others.

Any electronic communication that is conducted within or through RBS systems will be subject to being archived, monitored and produced to regulators and in litigation in accordance with RBS's policy and local laws, rules and regulations. Unless expressly prohibited by local law, electronic communications may be archived in countries other than the country in which you are located, and may be treated in accordance with the laws and regulations of the country of each individual included in the entire chain.

******************************************************************

Gary Yao-2

Re: Old Flink jobs restarting on Job Manager failover

Hi Steve,

What is the Flink version you are using?

Jobs are recovered from metadata stored in ZooKeeper. The behavior you describe
indicates that the submitted job graph is not deleted from ZooKeeper. By
default, the jobs that should be running/recovered are stored in znode:

/flink/default/jobgraphs

Can you check if the job id is still present in ZK after the cancelation? If
that is the case then there should be relevant warnings or errors in the
jobmanager log that should help debugging why the deletion failed.

Best,
Gary

On Thu, Apr 12, 2018 at 6:01 PM, <[hidden email]> wrote:

Hi all,

We have a Flink environment using zookeeper to manage the cluster. The high availability option is set up with the high-availability.storageDir parameter set to a shared directory on NAS; this is available to all nodes.

When zookeeper fails over to the standby JobManager during a cluster change, we see old jobs that have long been cancelled being restarted automatically by Flink. It seems like the standby JobManager is reconnecting with old configuration and old job details.

I can’t see anything in the log that gives any indication why this old job is restarting. I have noticed that the blob.storage.directory is set to a local directory.

Are there any other settings in Flink that might cause a Job Manager to restart against an old local state rather than the latest shared state?

Thanks,

Steve

Stephen Hesketh
Reporting Shared Services, NatWest Markets

250 Bishopsgate, London EC2M 4AA

Office: +44 (0)20 7678 1482 (internal 381482) | Mobile: +44 (0)7968 039848

******************************************************************
NatWest Markets is a marketing name of The Royal Bank of Scotland plc.
This communication and any attachments are confidential and intended solely for the addressee. If you are not the intended recipient please advise us immediately and delete it. Unless specifically stated in the message or otherwise indicated, you may not duplicate, redistribute or forward this message and any attachments are not intended for distribution to, or use by any person or entity in any jurisdiction or country where such distribution or use would be contrary to local law or regulation. The Royal Bank Of Scotland plc or any affiliated entity ("RBS") accepts no responsibility for any changes made to this message after it was sent.
Unless otherwise specifically indicated, the contents of this communication and its attachments are for information purposes only and should not be regarded as an offer or solicitation to buy or sell a product or service, confirmation of any transaction, a valuation, indicative price or an official statement. This communication has been prepared by the RBS trading desk, which may have a position or interest in the products or services mentioned that is inconsistent with any views expressed in this message. In evaluating the information contained in this message, you should know that it could have been previously provided to other clients and/or internal RBS personnel, who could have already acted on it.
RBS cannot provide absolute assurances that all electronic communications (sent or received) are secure, error free, not corrupted, incomplete or virus free and/or that they will not be lost, mis-delivered, destroyed, delayed or intercepted/decrypted by others. Therefore RBS disclaims all liability with regards to electronic communications (and the contents therein) if they are corrupted, lost destroyed, delayed, incomplete, mis-delivered, intercepted, decrypted or otherwise misappropriated by others.
Any electronic communication that is conducted within or through RBS systems will be subject to being archived, monitored and produced to regulators and in litigation in accordance with RBS's policy and local laws, rules and regulations. Unless expressly prohibited by local law, electronic communications may be archived in countries other than the country in which you are located, and may be treated in accordance with the laws and regulations of the country of each individual included in the entire chain.
Copyright 2014 The Royal Bank of Scotland plc. All rights reserved. See http://www.natwestmarkets.com/legal/s-t-discl.html for further risk disclosure.
******************************************************************