When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

gerardg
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

vino yang
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

gerardg
Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Andrey Zagrebin
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

gerardg
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Till Rohrmann
Hi Gerard,

the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In order to see this, the full jobmanager/cluster entry point logs would be tremendously helpful.

Cheers,
Till

On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <[hidden email]> wrote:
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

vino yang
Hi Till,

You are right, we also saw the problem you said. Curator removes the specific job graph path asynchronously. But it's the only gist when recovering, right? Is there any plan to enhance this point?

Thanks, vino.

2018-07-19 21:58 GMT+08:00 Till Rohrmann <[hidden email]>:
Hi Gerard,

the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In order to see this, the full jobmanager/cluster entry point logs would be tremendously helpful.

Cheers,
Till

On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <[hidden email]> wrote:
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager 9cf221e2340597629fb932c03aa147[hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

gerardg
Hi Till, 

I can't post the full log (as there is internal info in them) but I've found this. Is that what you are looking for?

11:29:17.351 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Starting Command Line Client (Version: 1.5-SNAPSHOT, Rev:a4fc4c6, Date:05.06.2018 @ 10:22:30 CEST)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  OS current user: (...)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Current Hadoop/Kerberos user: <no hadoop dependency found>
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Maximum heap size: 14254 MiBytes
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JAVA_HOME: (not set)
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  No Hadoop Dependency available
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM Options:
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog.file=/opt/flink/flink-1.5.0/log/flink-root-client-(...).log
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog4j.configuration=file:/opt/flink/flink-1.5.0/conf/log4j-cli.properties
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlogback.configurationFile=file:/opt/flink/flink-1.5.0/conf/logback.xml
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Program Arguments:
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     cancel
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     e403893e5208ca47ace886a77e405291
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Classpath: /opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.375 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.380 [main] WARN  org.apache.flink.client.cli.CliFrontend  - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1204)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1160)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1086)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 5 common frames omitted
11:29:17.385 [main] INFO  org.apache.flink.core.fs.FileSystem  - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
11:29:17.479 [main] INFO  o.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
11:29:17.489 [main] INFO  org.apache.flink.runtime.security.SecurityUtils  - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
11:29:17.518 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Running 'cancel' command.
11:29:17.523 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job e403893e5208ca47ace886a77e405291.
11:29:17.537 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
11:29:17.538 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
11:29:17.539 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink/default' as Zookeeper namespace.
11:29:17.574 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=(...)
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
11:29:17.579 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
11:29:17.589 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3807415919448894740.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
11:29:17.590 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
11:29:17.590 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
11:29:17.603 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
11:29:17.625 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903c3, negotiated timeout = 40000
11:29:17.626 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
11:29:17.764 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
11:29:17.766 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:17.812 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.007 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
11:29:18.008 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
11:29:18.008 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:18.009 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.010 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
11:29:18.030 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903c3 closed
11:29:18.030 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903c3
11:29:18.030 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job e403893e5208ca47ace886a77e405291.

Gerard

On Fri, Jul 20, 2018 at 5:14 AM vino yang <[hidden email]> wrote:
Hi Till,

You are right, we also saw the problem you said. Curator removes the specific job graph path asynchronously. But it's the only gist when recovering, right? Is there any plan to enhance this point?

Thanks, vino.

2018-07-19 21:58 GMT+08:00 Till Rohrmann <[hidden email]>:
Hi Gerard,

the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In order to see this, the full jobmanager/cluster entry point logs would be tremendously helpful.

Cheers,
Till

On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <[hidden email]> wrote:
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

gerardg
We have just started experiencing a different problem that could be related, maybe it helps to diagnose the issue. 

In the last 24h the jobmanager lost connection to Zookeeper a couple of times. Each time, a new jobmanager (in a different node) was elected leader correctly but the taskamangers kept trying to connect to the old jobmanager. These are the ending log messages until the taskamanger shut down itself.

12:06:41.747 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:41.748 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:41.748 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-15 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:06:51.766 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:51.767 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:51.767 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-7 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:07:01.123 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor  - Fatal error occurred in TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.125 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopping TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager removed spill file directory /home/tmp/flink-io-cf83b38e-53f1-4802-a097-e6db95b46084
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.NetworkEnvironment  - Shutting down the network environment and its components.
12:07:01.131 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyClient  - Successful shutdown (took 1 ms).
12:07:01.132 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyServer  - Successful shutdown (took 1 ms).
12:07:01.141 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService  - Stop job leader service.
12:07:01.142 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopped TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB cache
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB cache
12:07:01.148 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x3011955028f00f4 closed
12:07:01.173 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x3011955028f00f4
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopping Akka RPC service.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Shutting down remote daemon.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Remote daemon shut down; proceeding with flushing remote transports.
12:07:01.190 [flink-akka.actor.default-dispatcher-5] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-7 - Remoting shut down.

It seems like there are problems updating Zookeeper. I've have also noticed these messages in Zookeeper log:

 WARN  [SyncThread:2:FileTxnLog@378] - fsync-ing the write ahead log in SyncThread:2 took 1259ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

Maybe Flink hits some timeout?

Gerard

On Mon, Jul 23, 2018 at 11:57 AM Gerard Garcia <[hidden email]> wrote:
Hi Till, 

I can't post the full log (as there is internal info in them) but I've found this. Is that what you are looking for?

11:29:17.351 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Starting Command Line Client (Version: 1.5-SNAPSHOT, Rev:a4fc4c6, Date:05.06.2018 @ 10:22:30 CEST)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  OS current user: (...)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Current Hadoop/Kerberos user: <no hadoop dependency found>
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Maximum heap size: 14254 MiBytes
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JAVA_HOME: (not set)
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  No Hadoop Dependency available
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM Options:
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog.file=/opt/flink/flink-1.5.0/log/flink-root-client-(...).log
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog4j.configuration=file:/opt/flink/flink-1.5.0/conf/log4j-cli.properties
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlogback.configurationFile=file:/opt/flink/flink-1.5.0/conf/logback.xml
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Program Arguments:
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     cancel
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     e403893e5208ca47ace886a77e405291
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Classpath: /opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.375 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.380 [main] WARN  org.apache.flink.client.cli.CliFrontend  - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1204)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1160)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1086)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 5 common frames omitted
11:29:17.385 [main] INFO  org.apache.flink.core.fs.FileSystem  - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
11:29:17.479 [main] INFO  o.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
11:29:17.489 [main] INFO  org.apache.flink.runtime.security.SecurityUtils  - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
11:29:17.518 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Running 'cancel' command.
11:29:17.523 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job e403893e5208ca47ace886a77e405291.
11:29:17.537 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
11:29:17.538 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
11:29:17.539 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink/default' as Zookeeper namespace.
11:29:17.574 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=(...)
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
11:29:17.579 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
11:29:17.589 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3807415919448894740.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
11:29:17.590 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
11:29:17.590 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
11:29:17.603 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
11:29:17.625 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903c3, negotiated timeout = 40000
11:29:17.626 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
11:29:17.764 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
11:29:17.766 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:17.812 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.007 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
11:29:18.008 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
11:29:18.008 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:18.009 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.010 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
11:29:18.030 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903c3 closed
11:29:18.030 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903c3
11:29:18.030 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job e403893e5208ca47ace886a77e405291.

Gerard

On Fri, Jul 20, 2018 at 5:14 AM vino yang <[hidden email]> wrote:
Hi Till,

You are right, we also saw the problem you said. Curator removes the specific job graph path asynchronously. But it's the only gist when recovering, right? Is there any plan to enhance this point?

Thanks, vino.

2018-07-19 21:58 GMT+08:00 Till Rohrmann <[hidden email]>:
Hi Gerard,

the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In order to see this, the full jobmanager/cluster entry point logs would be tremendously helpful.

Cheers,
Till

On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <[hidden email]> wrote:
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Till Rohrmann
Hi Gerard,

the first log snippet from the client does not show anything suspicious. The warning just says that you cannot use the Yarn CLI because it lacks the Hadoop dependencies in the classpath.

The second snippet is indeed more interesting. If the TaskExecutors are not notified about the changed leader, then this might indicate a problem with the ZooKeeper connection or the ZooKeeper cluster itself. This might also explain why the job deletion from ZooKeeper does not succeed. 

One thing you could check is whether the leader ZNode under `/flink/default/leader/dispatcher_lock` (if you are using the defaults) actually contains the address of the newly elected leader. The leader path should also be logged in the cluster entrypoint logs. You can use the ZooKeeper cli for accessing the ZNodes.

Cheers,
Till

On Mon, Jul 23, 2018 at 4:07 PM Gerard Garcia <[hidden email]> wrote:
We have just started experiencing a different problem that could be related, maybe it helps to diagnose the issue. 

In the last 24h the jobmanager lost connection to Zookeeper a couple of times. Each time, a new jobmanager (in a different node) was elected leader correctly but the taskamangers kept trying to connect to the old jobmanager. These are the ending log messages until the taskamanger shut down itself.

12:06:41.747 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:41.748 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:41.748 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-15 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:06:51.766 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:51.767 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:51.767 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-7 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:07:01.123 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor  - Fatal error occurred in TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.125 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopping TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager removed spill file directory /home/tmp/flink-io-cf83b38e-53f1-4802-a097-e6db95b46084
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.NetworkEnvironment  - Shutting down the network environment and its components.
12:07:01.131 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyClient  - Successful shutdown (took 1 ms).
12:07:01.132 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyServer  - Successful shutdown (took 1 ms).
12:07:01.141 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService  - Stop job leader service.
12:07:01.142 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopped TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB cache
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB cache
12:07:01.148 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x3011955028f00f4 closed
12:07:01.173 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x3011955028f00f4
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopping Akka RPC service.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Shutting down remote daemon.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Remote daemon shut down; proceeding with flushing remote transports.
12:07:01.190 [flink-akka.actor.default-dispatcher-5] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-7 - Remoting shut down.

It seems like there are problems updating Zookeeper. I've have also noticed these messages in Zookeeper log:

 WARN  [SyncThread:2:FileTxnLog@378] - fsync-ing the write ahead log in SyncThread:2 took 1259ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

Maybe Flink hits some timeout?

Gerard

On Mon, Jul 23, 2018 at 11:57 AM Gerard Garcia <[hidden email]> wrote:
Hi Till, 

I can't post the full log (as there is internal info in them) but I've found this. Is that what you are looking for?

11:29:17.351 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Starting Command Line Client (Version: 1.5-SNAPSHOT, Rev:a4fc4c6, Date:05.06.2018 @ 10:22:30 CEST)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  OS current user: (...)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Current Hadoop/Kerberos user: <no hadoop dependency found>
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Maximum heap size: 14254 MiBytes
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JAVA_HOME: (not set)
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  No Hadoop Dependency available
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM Options:
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog.file=/opt/flink/flink-1.5.0/log/flink-root-client-(...).log
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog4j.configuration=file:/opt/flink/flink-1.5.0/conf/log4j-cli.properties
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlogback.configurationFile=file:/opt/flink/flink-1.5.0/conf/logback.xml
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Program Arguments:
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     cancel
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     e403893e5208ca47ace886a77e405291
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Classpath: /opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.375 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.380 [main] WARN  org.apache.flink.client.cli.CliFrontend  - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1204)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1160)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1086)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 5 common frames omitted
11:29:17.385 [main] INFO  org.apache.flink.core.fs.FileSystem  - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
11:29:17.479 [main] INFO  o.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
11:29:17.489 [main] INFO  org.apache.flink.runtime.security.SecurityUtils  - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
11:29:17.518 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Running 'cancel' command.
11:29:17.523 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job e403893e5208ca47ace886a77e405291.
11:29:17.537 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
11:29:17.538 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
11:29:17.539 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink/default' as Zookeeper namespace.
11:29:17.574 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=(...)
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
11:29:17.579 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
11:29:17.589 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3807415919448894740.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
11:29:17.590 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
11:29:17.590 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
11:29:17.603 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
11:29:17.625 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903c3, negotiated timeout = 40000
11:29:17.626 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
11:29:17.764 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
11:29:17.766 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:17.812 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.007 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
11:29:18.008 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
11:29:18.008 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:18.009 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.010 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
11:29:18.030 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903c3 closed
11:29:18.030 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903c3
11:29:18.030 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job e403893e5208ca47ace886a77e405291.

Gerard

On Fri, Jul 20, 2018 at 5:14 AM vino yang <[hidden email]> wrote:
Hi Till,

You are right, we also saw the problem you said. Curator removes the specific job graph path asynchronously. But it's the only gist when recovering, right? Is there any plan to enhance this point?

Thanks, vino.

2018-07-19 21:58 GMT+08:00 Till Rohrmann <[hidden email]>:
Hi Gerard,

the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In order to see this, the full jobmanager/cluster entry point logs would be tremendously helpful.

Cheers,
Till

On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <[hidden email]> wrote:
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

gerardg
Hi Till,

Sorry for the late reply, I was waiting to update to Flink 1.6.0 to see if the problem got fixed but I still experience the first issue (jobgraph not deleted from zookeeper when task is canceled). The second issue (taskmanagers unable to register to the new elected jobmanager) was actually a configuration error, all jobmanagers had the "jobmanager.rpc.address" option set to point to the same jobmanager so I guess the new one was not registering its url correctly in Zookeeper.

And for the first issue, it seems to be a known bug:

so I guess I'll have to wait until there is a fix as I have not seens any workaround (other than removing the jobgraph from Zookeeper after cancelling the task)

Thanks,

Gerard


On Tue, Jul 24, 2018 at 1:46 PM Till Rohrmann <[hidden email]> wrote:
Hi Gerard,

the first log snippet from the client does not show anything suspicious. The warning just says that you cannot use the Yarn CLI because it lacks the Hadoop dependencies in the classpath.

The second snippet is indeed more interesting. If the TaskExecutors are not notified about the changed leader, then this might indicate a problem with the ZooKeeper connection or the ZooKeeper cluster itself. This might also explain why the job deletion from ZooKeeper does not succeed. 

One thing you could check is whether the leader ZNode under `/flink/default/leader/dispatcher_lock` (if you are using the defaults) actually contains the address of the newly elected leader. The leader path should also be logged in the cluster entrypoint logs. You can use the ZooKeeper cli for accessing the ZNodes.

Cheers,
Till

On Mon, Jul 23, 2018 at 4:07 PM Gerard Garcia <[hidden email]> wrote:
We have just started experiencing a different problem that could be related, maybe it helps to diagnose the issue. 

In the last 24h the jobmanager lost connection to Zookeeper a couple of times. Each time, a new jobmanager (in a different node) was elected leader correctly but the taskamangers kept trying to connect to the old jobmanager. These are the ending log messages until the taskamanger shut down itself.

12:06:41.747 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:41.748 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:41.748 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-15 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:06:51.766 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:51.767 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:51.767 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-7 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:07:01.123 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor  - Fatal error occurred in TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.125 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopping TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager removed spill file directory /home/tmp/flink-io-cf83b38e-53f1-4802-a097-e6db95b46084
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.NetworkEnvironment  - Shutting down the network environment and its components.
12:07:01.131 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyClient  - Successful shutdown (took 1 ms).
12:07:01.132 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyServer  - Successful shutdown (took 1 ms).
12:07:01.141 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService  - Stop job leader service.
12:07:01.142 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopped TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB cache
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB cache
12:07:01.148 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x3011955028f00f4 closed
12:07:01.173 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x3011955028f00f4
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopping Akka RPC service.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Shutting down remote daemon.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Remote daemon shut down; proceeding with flushing remote transports.
12:07:01.190 [flink-akka.actor.default-dispatcher-5] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-7 - Remoting shut down.

It seems like there are problems updating Zookeeper. I've have also noticed these messages in Zookeeper log:

 WARN  [SyncThread:2:FileTxnLog@378] - fsync-ing the write ahead log in SyncThread:2 took 1259ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

Maybe Flink hits some timeout?

Gerard

On Mon, Jul 23, 2018 at 11:57 AM Gerard Garcia <[hidden email]> wrote:
Hi Till, 

I can't post the full log (as there is internal info in them) but I've found this. Is that what you are looking for?

11:29:17.351 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Starting Command Line Client (Version: 1.5-SNAPSHOT, Rev:a4fc4c6, Date:05.06.2018 @ 10:22:30 CEST)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  OS current user: (...)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Current Hadoop/Kerberos user: <no hadoop dependency found>
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Maximum heap size: 14254 MiBytes
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JAVA_HOME: (not set)
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  No Hadoop Dependency available
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM Options:
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog.file=/opt/flink/flink-1.5.0/log/flink-root-client-(...).log
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog4j.configuration=file:/opt/flink/flink-1.5.0/conf/log4j-cli.properties
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlogback.configurationFile=file:/opt/flink/flink-1.5.0/conf/logback.xml
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Program Arguments:
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     cancel
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     e403893e5208ca47ace886a77e405291
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Classpath: /opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.375 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.380 [main] WARN  org.apache.flink.client.cli.CliFrontend  - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1204)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1160)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1086)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 5 common frames omitted
11:29:17.385 [main] INFO  org.apache.flink.core.fs.FileSystem  - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
11:29:17.479 [main] INFO  o.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
11:29:17.489 [main] INFO  org.apache.flink.runtime.security.SecurityUtils  - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
11:29:17.518 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Running 'cancel' command.
11:29:17.523 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job e403893e5208ca47ace886a77e405291.
11:29:17.537 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
11:29:17.538 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
11:29:17.539 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink/default' as Zookeeper namespace.
11:29:17.574 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=(...)
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
11:29:17.579 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
11:29:17.589 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3807415919448894740.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
11:29:17.590 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
11:29:17.590 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
11:29:17.603 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
11:29:17.625 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903c3, negotiated timeout = 40000
11:29:17.626 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
11:29:17.764 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
11:29:17.766 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:17.812 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.007 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
11:29:18.008 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
11:29:18.008 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:18.009 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.010 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
11:29:18.030 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903c3 closed
11:29:18.030 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903c3
11:29:18.030 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job e403893e5208ca47ace886a77e405291.

Gerard

On Fri, Jul 20, 2018 at 5:14 AM vino yang <[hidden email]> wrote:
Hi Till,

You are right, we also saw the problem you said. Curator removes the specific job graph path asynchronously. But it's the only gist when recovering, right? Is there any plan to enhance this point?

Thanks, vino.

2018-07-19 21:58 GMT+08:00 Till Rohrmann <[hidden email]>:
Hi Gerard,

the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In order to see this, the full jobmanager/cluster entry point logs would be tremendously helpful.

Cheers,
Till

On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <[hidden email]> wrote:
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



Reply | Threaded
Open this post in threaded view
|

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

Till Rohrmann
Thanks for the update Gerard. Fixing the resource cleanup in the case of standby Dispatchers/JobMasters has a high priority. We will hopefully fix the problem with the next bug fix release. Until then, the JobGraph entry must be removed from ZooKeeper manually.

Cheers,
Till

On Wed, Aug 29, 2018 at 3:17 PM Gerard Garcia <[hidden email]> wrote:
Hi Till,

Sorry for the late reply, I was waiting to update to Flink 1.6.0 to see if the problem got fixed but I still experience the first issue (jobgraph not deleted from zookeeper when task is canceled). The second issue (taskmanagers unable to register to the new elected jobmanager) was actually a configuration error, all jobmanagers had the "jobmanager.rpc.address" option set to point to the same jobmanager so I guess the new one was not registering its url correctly in Zookeeper.

And for the first issue, it seems to be a known bug:

so I guess I'll have to wait until there is a fix as I have not seens any workaround (other than removing the jobgraph from Zookeeper after cancelling the task)

Thanks,

Gerard


On Tue, Jul 24, 2018 at 1:46 PM Till Rohrmann <[hidden email]> wrote:
Hi Gerard,

the first log snippet from the client does not show anything suspicious. The warning just says that you cannot use the Yarn CLI because it lacks the Hadoop dependencies in the classpath.

The second snippet is indeed more interesting. If the TaskExecutors are not notified about the changed leader, then this might indicate a problem with the ZooKeeper connection or the ZooKeeper cluster itself. This might also explain why the job deletion from ZooKeeper does not succeed. 

One thing you could check is whether the leader ZNode under `/flink/default/leader/dispatcher_lock` (if you are using the defaults) actually contains the address of the newly elected leader. The leader path should also be logged in the cluster entrypoint logs. You can use the ZooKeeper cli for accessing the ZNodes.

Cheers,
Till

On Mon, Jul 23, 2018 at 4:07 PM Gerard Garcia <[hidden email]> wrote:
We have just started experiencing a different problem that could be related, maybe it helps to diagnose the issue. 

In the last 24h the jobmanager lost connection to Zookeeper a couple of times. Each time, a new jobmanager (in a different node) was elected leader correctly but the taskamangers kept trying to connect to the old jobmanager. These are the ending log messages until the taskamanger shut down itself.

12:06:41.747 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:41.748 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:41.748 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-15 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:06:51.766 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: (...)1/192.168.1.9:35605
12:06:51.767 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Could not resolve ResourceManager address akka.tcp://flink@(...)1:35605/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@(...)1:35605/user/resourcemanager..
12:06:51.767 [flink-akka.actor.default-dispatcher-5] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-7 - Association with remote system [akka.tcp://flink@(...)1:35605] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@(...)1:35605]] Caused by: [Connection refused: (...)1/192.168.1.9:35605]
12:07:01.123 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor  - Fatal error occurred in TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.125 [flink-akka.actor.default-dispatcher-5] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1018)
        at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1004)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopping TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
12:07:01.128 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager removed spill file directory /home/tmp/flink-io-cf83b38e-53f1-4802-a097-e6db95b46084
12:07:01.130 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.NetworkEnvironment  - Shutting down the network environment and its components.
12:07:01.131 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyClient  - Successful shutdown (took 1 ms).
12:07:01.132 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.io.network.netty.NettyServer  - Successful shutdown (took 1 ms).
12:07:01.141 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService  - Stop job leader service.
12:07:01.142 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Stopped TaskExecutor akka.tcp://flink@(...)2:33455/user/taskmanager_0.
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB cache
12:07:01.143 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB cache
12:07:01.148 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x3011955028f00f4 closed
12:07:01.173 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x3011955028f00f4
12:07:01.173 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopping Akka RPC service.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Shutting down remote daemon.
12:07:01.178 [flink-akka.actor.default-dispatcher-3] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-16 - Remote daemon shut down; proceeding with flushing remote transports.
12:07:01.190 [flink-akka.actor.default-dispatcher-5] INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-7 - Remoting shut down.

It seems like there are problems updating Zookeeper. I've have also noticed these messages in Zookeeper log:

 WARN  [SyncThread:2:FileTxnLog@378] - fsync-ing the write ahead log in SyncThread:2 took 1259ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

Maybe Flink hits some timeout?

Gerard

On Mon, Jul 23, 2018 at 11:57 AM Gerard Garcia <[hidden email]> wrote:
Hi Till, 

I can't post the full log (as there is internal info in them) but I've found this. Is that what you are looking for?

11:29:17.351 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Starting Command Line Client (Version: 1.5-SNAPSHOT, Rev:a4fc4c6, Date:05.06.2018 @ 10:22:30 CEST)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  OS current user: (...)
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Current Hadoop/Kerberos user: <no hadoop dependency found>
11:29:17.372 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Maximum heap size: 14254 MiBytes
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JAVA_HOME: (not set)
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  No Hadoop Dependency available
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  JVM Options:
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog.file=/opt/flink/flink-1.5.0/log/flink-root-client-(...).log
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlog4j.configuration=file:/opt/flink/flink-1.5.0/conf/log4j-cli.properties
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     -Dlogback.configurationFile=file:/opt/flink/flink-1.5.0/conf/logback.xml
11:29:17.373 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Program Arguments:
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     cancel
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -     e403893e5208ca47ace886a77e405291
11:29:17.374 [main] INFO  org.apache.flink.client.cli.CliFrontend  -  Classpath: /opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.375 [main] INFO  org.apache.flink.client.cli.CliFrontend  - --------------------------------------------------------------------------------
11:29:17.380 [main] WARN  org.apache.flink.client.cli.CliFrontend  - Could not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1204)
        at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1160)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1086)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 5 common frames omitted
11:29:17.385 [main] INFO  org.apache.flink.core.fs.FileSystem  - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
11:29:17.479 [main] INFO  o.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
11:29:17.489 [main] INFO  org.apache.flink.runtime.security.SecurityUtils  - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
11:29:17.518 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Running 'cancel' command.
11:29:17.523 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job e403893e5208ca47ace886a77e405291.
11:29:17.537 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
11:29:17.538 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
11:29:17.539 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink/default' as Zookeeper namespace.
11:29:17.574 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
11:29:17.577 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=(...)
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
11:29:17.578 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
11:29:17.579 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
11:29:17.580 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=(...)
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
11:29:17.581 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
11:29:17.589 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3807415919448894740.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
11:29:17.590 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
11:29:17.590 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
11:29:17.603 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
11:29:17.625 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903c3, negotiated timeout = 40000
11:29:17.626 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
11:29:17.764 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
11:29:17.766 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:17.812 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.007 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
11:29:18.008 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
11:29:18.008 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
11:29:18.009 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
11:29:18.010 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
11:29:18.030 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903c3 closed
11:29:18.030 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903c3
11:29:18.030 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job e403893e5208ca47ace886a77e405291.

Gerard

On Fri, Jul 20, 2018 at 5:14 AM vino yang <[hidden email]> wrote:
Hi Till,

You are right, we also saw the problem you said. Curator removes the specific job graph path asynchronously. But it's the only gist when recovering, right? Is there any plan to enhance this point?

Thanks, vino.

2018-07-19 21:58 GMT+08:00 Till Rohrmann <[hidden email]>:
Hi Gerard,

the logging statement `Removed job graph ... from ZooKeeper` is actually not 100% accurate. The actual deletion is executed as an asynchronous background task and the log statement is not printed in the callback (which it should). Therefore, the deletion could still have failed. In order to see this, the full jobmanager/cluster entry point logs would be tremendously helpful.

Cheers,
Till

On Thu, Jul 19, 2018 at 1:33 PM Gerard Garcia <[hidden email]> wrote:
Thanks Andrey,

That is the log from the jobmanager just after it has finished cancelling the task:

11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job e403893e5208ca47ace886a77e405291.
11:29:18.716 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
11:29:18.738 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink-eur/default/checkpoints/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
11:29:18.780 [flink-akka.actor.default-dispatcher-15695] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/e403893e5208ca47ace886a77e405291 from ZooKeeper
11:29:18.827 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job e403893e5208ca47ace886a77e405291 reached globally terminal state CANCELED.
11:29:18.846 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the JobMaster for job (...)(e403893e5208ca47ace886a77e405291).
11:29:18.848 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  org.apache.flink.runtime.jobmaster.JobMaster  - Close ResourceManager connection d5fbc30a895066054e29fb2fd60fb0f1: JobManager is shutting down..
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping SlotPool.
11:29:18.864 [flink-akka.actor.default-dispatcher-15688] INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@(...):33827/user/jobmanager_9 for job e403893e5208ca47ace886a77e405291 from the resource manager.
11:29:18.864 [flink-akka.actor.default-dispatcher-15675] INFO  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/e403893e5208ca47ace886a77e405291/job_manager_lock'}.
11:29:18.980 [flink-akka.actor.default-dispatcher-15695] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 31154 for job 5d8c376b10d358b9c9470b3e70113626 (132520 bytes in 411 ms).
11:29:19.025 [flink-akka.actor.default-dispatcher-15683] INFO  o.a.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper.


At the end it says removed job graph e403893e5208ca47ace886a77e405291 from ZooKeeper but I still can see it at /flink/default/jobgraphs:

[zk: localhost:2181(CONNECTED) 14] ls /flink/default/jobgraphs/e403893e5208ca47ace886a77e405291
[3fe9c3c8-5bec-404e-a720-75f9b188124f, 36208299-0f6d-462c-bae4-2e3d53f50e8c]

Gerard

On Wed, Jul 18, 2018 at 4:24 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Gerard,

There is an issue recently fixed for 1.5.2, 1.6.0:
It might have caused your problem.

Can you please provide log from JobManager/Entry point for further investigation?

Cheers,
Andrey

On 18 Jul 2018, at 10:16, Gerard Garcia <[hidden email]> wrote:

Hi vino,

Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.

That is the client log:

09:20:58.492 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelling job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.503 [main] INFO  org.apache.flink.runtime.blob.FileSystemBlobStore  - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Enforcing default ACL for ZK connections
09:20:58.505 [main] INFO  org.apache.flink.runtime.util.ZooKeeperUtils  - Using '/flink-eur/default' as Zookeeper namespace.
09:20:58.539 [main] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - Starting
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=
3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-eur-production1
09:20:58.543 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_131
09:20:58.544 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar:::
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
09:20:58.546 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.9.87-xxxx-std-ipv6-64
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/root
09:20:58.547 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink/flink-1.5.0/bin
09:20:58.548 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe
09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server 10.1.1.5/10.1.1.5:2181
09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session
09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000
09:20:58.593 [main-EventThread] INFO  o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
09:20:58.711 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest client endpoint started.
09:20:58.713 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.755 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Shutting down rest endpoint.
09:20:58.946 [main] INFO  org.apache.flink.runtime.rest.RestClient  - Rest endpoint shutdown complete.
09:20:58.947 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
09:20:58.948 [main] INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
09:20:58.949 [Curator-Framework-0] INFO  o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
09:20:58.968 [main] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x100571bda1903b7 closed
09:20:58.968 [main] INFO  org.apache.flink.client.cli.CliFrontend  - Cancelled job 75e16686cb4fe0d33ead8e29af131d09.
09:20:58.969 [main-EventThread] INFO  o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x100571bda1903b7

I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct?

Gerard 

On Wed, Jul 18, 2018 at 9:17 AM vino yang <[hidden email]> wrote:
Hi Gerard,

From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? 
And can not be restarted because blob files can not be find?

Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. 

Thanks, vino.

2018-07-16 21:19 GMT+08:00 gerardg <[hidden email]>:
Hi,

Our deployment consists of a standalone HA cluster of 8 machines with an
external Zookeeper cluster. We have observed several times that when a
jobmanager fails and a new one is elected, the new one tries to restart
more jobs than the ones that were running and since it can't find some
files, it fails and gets stuck in a restart loop. That is the error that we
see in the logs:



These are the contents of /home/nas/flink/ha/default/blob/:



We've checked zookeeper and there are actually a lot of jobgraphs in
/flink/default/jobgraphs



There were only three jobs running so neither zookeeper nor the flink 'ha'
folder seems to have the correct number of jobgraphs stored.

The only way we have to solve this is to remove everything at path /flink in
zookeeper and the 'ha' flink folder and restart the jobs manually.

I'll try to monitor if some action (e.g. we have been canceling and
restoring jobs from savepoints quite often lately) leaves an entry in
zookeepers path /flink/default/jobgraphs of a job that is not running but
maybe someone can't point us to some configuration problem that could cause
this behavior.

Thanks,

Gerard





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/