JobGraphs not cleaned up in HA mode

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

JobGraphs not cleaned up in HA mode

Encho Mishinev
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

vino yang
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

vino yang
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Encho Mishinev
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

vino yang
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Till Rohrmann
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Encho Mishinev
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

vino yang
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Encho Mishinev
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Encho Mishinev
Hello,

Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an YARN cluster but does not specify such option for Kubernetes. Is there anything that can go wrong with this approach?

Thanks

On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <[hidden email]> wrote:
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Till Rohrmann
Hi Encho,

it sounds strange that the standby JobManager tries to recover a submitted job graph. This should only happen if it has been granted leadership. Thus, it seems as if the standby JobManager thinks that it is also the leader. Could you maybe share the logs of the two JobManagers/ClusterEntrypoints with us?

Running only a single JobManager/ClusterEntrypoint in HA mode via a Kubernetes Deployment should do the trick and there is nothing wrong with it.

Cheers,
Till

On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <[hidden email]> wrote:
Hello,

Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an YARN cluster but does not specify such option for Kubernetes. Is there anything that can go wrong with this approach?

Thanks

On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <[hidden email]> wrote:
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Encho Mishinev
Hi Till,

I will use the approach with a k8s deployment and HA mode with a single job manager. Nonetheless, here are the logs I just produced by repeating the aforementioned experiment, hope they help in debugging:

- Starting Jobmanager-1:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-1-f76fd4df8-ftwt9.
2018-08-29 11:41:48,806 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:49,134 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:49,215 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-1
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:49,250 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:49,282 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:49,298 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:49,309 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-1:50010
2018-08-29 11:41:49,768 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:49,823 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:49,974 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-1:50010]
2018-08-29 11:41:49,981 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-1:50010
2018-08-29 11:41:50,444 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:50,568 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:50,605 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a
2018-08-29 11:41:50,605 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:50,607 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:50,607 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:50,608 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:50,609 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:50,618 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout = 40000
2018-08-29 11:41:50,619 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:50,627 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:50,633 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:50,659 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184
2018-08-29 11:41:50,674 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:50,675 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:50,676 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:50,679 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-1:8081
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:51,091 WARN  org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths  - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
2018-08-29 11:41:51,101 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-1:8081.
2018-08-29 11:41:51,114 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:51,141 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://flink-jobmanager-1:8081 was granted leadership with leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd
2018-08-29 11:41:51,214 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:51,230 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:51,232 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:51,234 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:51,235 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 11:41:51,253 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49
2018-08-29 11:41:51,254 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-29 11:41:51,468 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under 1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager.
2018-08-29 11:41:51,471 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager 104d18b72fed054620e58e120a1ea083 under e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager.

Starting Jobmanager-2:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-kmvw9.
2018-08-29 11:41:51,688 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:52,018 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:52,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,122 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:52,123 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:52,133 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:52,173 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:52,188 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:52,198 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 11:41:52,753 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:52,822 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:53,038 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 11:41:53,046 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 11:41:53,500 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:53,558 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:53,559 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:53,616 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:53,624 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:53,644 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:53,646 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:53,646 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade
2018-08-29 11:41:53,646 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:53,647 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:53,649 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:53,655 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout = 40000
2018-08-29 11:41:53,656 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:53,667 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:53,673 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:53,699 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0
2018-08-29 11:41:53,717 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:53,718 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:53,719 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:53,722 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:54,180 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 11:41:54,192 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:54,273 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:54,286 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:54,287 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Upon submitting a batch job on Jobmanager-1, we immediately get this log on Jobmanager-2
2018-08-29 11:47:06,249 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null).

Meanwhile Jobmanager-1 gets:
-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)

2018-08-29 11:47:06,006 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting job d69b67e4d28a2d244b06d3f6d661bca1 (sicassandrawriterbeam-flink-0829114703-7d95fabd).
2018-08-29 11:47:06,090 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to ZooKeeper.

-loads of job execution info-

2018-08-29 11:49:20,272 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED.
2018-08-29 11:49:20,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the JobMaster for job sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1).
2018-08-29 11:49:20,290 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Close ResourceManager connection 827b94881bf7c94d8516907e04e3a564: JobManager is shutting down..
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}.
2018-08-29 11:49:20,304 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper.


-------------------

The result is:
HDFS has only a jobgraph and an empty default folder - everything else is cleared
ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the last log still there.

On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

it sounds strange that the standby JobManager tries to recover a submitted job graph. This should only happen if it has been granted leadership. Thus, it seems as if the standby JobManager thinks that it is also the leader. Could you maybe share the logs of the two JobManagers/ClusterEntrypoints with us?

Running only a single JobManager/ClusterEntrypoint in HA mode via a Kubernetes Deployment should do the trick and there is nothing wrong with it.

Cheers,
Till

On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <[hidden email]> wrote:
Hello,

Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an YARN cluster but does not specify such option for Kubernetes. Is there anything that can go wrong with this approach?

Thanks

On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <[hidden email]> wrote:
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Till Rohrmann
Hi Encho,

thanks for sending the first part of the logs. What I would actually be interested in are the complete logs because somewhere in the jobmanager-2 logs there must be a log statement saying that the respective dispatcher gained leadership. I would like to see why this happens but for this to debug the complete logs are necessary. It would be awesome if you could send them to me. Thanks a lot!

Cheers,
Till

On Wed, Aug 29, 2018 at 2:00 PM Encho Mishinev <[hidden email]> wrote:
Hi Till,

I will use the approach with a k8s deployment and HA mode with a single job manager. Nonetheless, here are the logs I just produced by repeating the aforementioned experiment, hope they help in debugging:

- Starting Jobmanager-1:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-1-f76fd4df8-ftwt9.
2018-08-29 11:41:48,806 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:49,134 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:49,215 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-1
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:49,250 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:49,282 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:49,298 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:49,309 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-1:50010
2018-08-29 11:41:49,768 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:49,823 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:49,974 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-1:50010]
2018-08-29 11:41:49,981 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-1:50010
2018-08-29 11:41:50,444 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:50,568 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:50,605 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a
2018-08-29 11:41:50,605 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:50,607 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:50,607 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:50,608 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:50,609 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:50,618 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout = 40000
2018-08-29 11:41:50,619 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:50,627 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:50,633 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:50,659 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184
2018-08-29 11:41:50,674 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:50,675 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:50,676 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:50,679 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-1:8081
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:51,091 WARN  org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths  - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
2018-08-29 11:41:51,101 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-1:8081.
2018-08-29 11:41:51,114 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:51,141 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://flink-jobmanager-1:8081 was granted leadership with leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd
2018-08-29 11:41:51,214 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:51,230 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:51,232 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:51,234 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:51,235 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 11:41:51,253 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49
2018-08-29 11:41:51,254 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-29 11:41:51,468 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under 1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager.
2018-08-29 11:41:51,471 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager 104d18b72fed054620e58e120a1ea083 under e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager.

Starting Jobmanager-2:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-kmvw9.
2018-08-29 11:41:51,688 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:52,018 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:52,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,122 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:52,123 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:52,133 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:52,173 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:52,188 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:52,198 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 11:41:52,753 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:52,822 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:53,038 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 11:41:53,046 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 11:41:53,500 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:53,558 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:53,559 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:53,616 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:53,624 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:53,644 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:53,646 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:53,646 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade
2018-08-29 11:41:53,646 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:53,647 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:53,649 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:53,655 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout = 40000
2018-08-29 11:41:53,656 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:53,667 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:53,673 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:53,699 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0
2018-08-29 11:41:53,717 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:53,718 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:53,719 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:53,722 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:54,180 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 11:41:54,192 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:54,273 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:54,286 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:54,287 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Upon submitting a batch job on Jobmanager-1, we immediately get this log on Jobmanager-2
2018-08-29 11:47:06,249 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null).

Meanwhile Jobmanager-1 gets:
-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)

2018-08-29 11:47:06,006 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting job d69b67e4d28a2d244b06d3f6d661bca1 (sicassandrawriterbeam-flink-0829114703-7d95fabd).
2018-08-29 11:47:06,090 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to ZooKeeper.

-loads of job execution info-

2018-08-29 11:49:20,272 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED.
2018-08-29 11:49:20,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the JobMaster for job sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1).
2018-08-29 11:49:20,290 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Close ResourceManager connection 827b94881bf7c94d8516907e04e3a564: JobManager is shutting down..
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}.
2018-08-29 11:49:20,304 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper.


-------------------

The result is:
HDFS has only a jobgraph and an empty default folder - everything else is cleared
ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the last log still there.

On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

it sounds strange that the standby JobManager tries to recover a submitted job graph. This should only happen if it has been granted leadership. Thus, it seems as if the standby JobManager thinks that it is also the leader. Could you maybe share the logs of the two JobManagers/ClusterEntrypoints with us?

Running only a single JobManager/ClusterEntrypoint in HA mode via a Kubernetes Deployment should do the trick and there is nothing wrong with it.

Cheers,
Till

On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <[hidden email]> wrote:
Hello,

Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an YARN cluster but does not specify such option for Kubernetes. Is there anything that can go wrong with this approach?

Thanks

On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <[hidden email]> wrote:
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Encho Mishinev
Hi Till,

Those are actually the full logs except the two parts I shortened (pipeline construction and execution). As I said - accessing the UI for Jobmanager 2 redirects to Jobmanager 1 so it seems like he is aware that he is not the leader. Jobmanager 2 has no other logs than what I sent. Here is the full end-to-end log of Jobmanager 2 after repeating the experiment again:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sediVa6XS: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-zwdqv.
2018-08-29 13:19:24,047 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 13:19:24,049 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 13:19:24,049 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 13:19:24,367 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 13:19:24,435 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 13:19:24,435 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 13:19:24,436 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 13:19:24,461 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 13:19:24,461 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 13:19:24,472 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 13:19:24,506 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 13:19:24,522 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 13:19:24,532 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 13:19:24,996 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 13:19:25,050 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 13:19:25,209 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 13:19:25,216 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 13:19:25,648 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 13:19:25,702 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 13:19:25,703 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 13:19:25,750 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 13:19:25,756 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 13:19:25,756 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-zwdqv
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 13:19:25,758 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 13:19:25,775 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5000339768628554676.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 13:19:25,776 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 13:19:25,777 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 13:19:25,777 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-40cefeee-e8d1-4522-aea3-957d9f7fbeee
2018-08-29 13:19:25,777 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 13:19:25,778 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 13:19:25,788 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690009, negotiated timeout = 40000
2018-08-29 13:19:25,789 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 13:19:25,793 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 13:19:25,798 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-76cce4e7-84ea-4624-a847-bbd7fdc4f109, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 13:19:25,824 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-7c6c2db0-f7ab-4cb6-909d-6c9cbfd78215
2018-08-29 13:19:25,838 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 13:19:25,839 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 13:19:25,840 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 13:19:25,843 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 13:19:26,143 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 13:19:26,143 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 13:19:26,216 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 13:19:26,217 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 13:19:26,236 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 13:19:26,248 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 13:19:26,323 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 13:19:26,335 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 13:19:26,336 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 13:19:26,338 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 13:19:26,339 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 13:23:21,513 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(836b29f7c66bbeb6ed8bae41cb9b316c, null).

Thanks,
Encho

On Wed, Aug 29, 2018 at 3:59 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks for sending the first part of the logs. What I would actually be interested in are the complete logs because somewhere in the jobmanager-2 logs there must be a log statement saying that the respective dispatcher gained leadership. I would like to see why this happens but for this to debug the complete logs are necessary. It would be awesome if you could send them to me. Thanks a lot!

Cheers,
Till

On Wed, Aug 29, 2018 at 2:00 PM Encho Mishinev <[hidden email]> wrote:
Hi Till,

I will use the approach with a k8s deployment and HA mode with a single job manager. Nonetheless, here are the logs I just produced by repeating the aforementioned experiment, hope they help in debugging:

- Starting Jobmanager-1:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-1-f76fd4df8-ftwt9.
2018-08-29 11:41:48,806 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:49,134 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:49,215 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-1
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:49,250 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:49,282 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:49,298 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:49,309 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-1:50010
2018-08-29 11:41:49,768 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:49,823 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:49,974 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-1:50010]
2018-08-29 11:41:49,981 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-1:50010
2018-08-29 11:41:50,444 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:50,568 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:50,605 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a
2018-08-29 11:41:50,605 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:50,607 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:50,607 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:50,608 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:50,609 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:50,618 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout = 40000
2018-08-29 11:41:50,619 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:50,627 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:50,633 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:50,659 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184
2018-08-29 11:41:50,674 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:50,675 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:50,676 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:50,679 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-1:8081
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:51,091 WARN  org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths  - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
2018-08-29 11:41:51,101 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-1:8081.
2018-08-29 11:41:51,114 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:51,141 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://flink-jobmanager-1:8081 was granted leadership with leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd
2018-08-29 11:41:51,214 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:51,230 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:51,232 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:51,234 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:51,235 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 11:41:51,253 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49
2018-08-29 11:41:51,254 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-29 11:41:51,468 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under 1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager.
2018-08-29 11:41:51,471 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager 104d18b72fed054620e58e120a1ea083 under e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager.

Starting Jobmanager-2:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-kmvw9.
2018-08-29 11:41:51,688 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:52,018 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:52,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,122 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:52,123 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:52,133 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:52,173 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:52,188 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:52,198 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 11:41:52,753 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:52,822 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:53,038 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 11:41:53,046 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 11:41:53,500 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:53,558 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:53,559 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:53,616 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:53,624 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:53,644 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:53,646 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:53,646 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade
2018-08-29 11:41:53,646 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:53,647 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:53,649 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:53,655 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout = 40000
2018-08-29 11:41:53,656 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:53,667 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:53,673 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:53,699 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0
2018-08-29 11:41:53,717 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:53,718 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:53,719 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:53,722 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:54,180 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 11:41:54,192 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:54,273 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:54,286 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:54,287 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Upon submitting a batch job on Jobmanager-1, we immediately get this log on Jobmanager-2
2018-08-29 11:47:06,249 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null).

Meanwhile Jobmanager-1 gets:
-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)

2018-08-29 11:47:06,006 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting job d69b67e4d28a2d244b06d3f6d661bca1 (sicassandrawriterbeam-flink-0829114703-7d95fabd).
2018-08-29 11:47:06,090 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to ZooKeeper.

-loads of job execution info-

2018-08-29 11:49:20,272 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED.
2018-08-29 11:49:20,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the JobMaster for job sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1).
2018-08-29 11:49:20,290 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Close ResourceManager connection 827b94881bf7c94d8516907e04e3a564: JobManager is shutting down..
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}.
2018-08-29 11:49:20,304 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper.


-------------------

The result is:
HDFS has only a jobgraph and an empty default folder - everything else is cleared
ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the last log still there.

On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

it sounds strange that the standby JobManager tries to recover a submitted job graph. This should only happen if it has been granted leadership. Thus, it seems as if the standby JobManager thinks that it is also the leader. Could you maybe share the logs of the two JobManagers/ClusterEntrypoints with us?

Running only a single JobManager/ClusterEntrypoint in HA mode via a Kubernetes Deployment should do the trick and there is nothing wrong with it.

Cheers,
Till

On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <[hidden email]> wrote:
Hello,

Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an YARN cluster but does not specify such option for Kubernetes. Is there anything that can go wrong with this approach?

Thanks

On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <[hidden email]> wrote:
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Till Rohrmann
Hi Encho,

thanks for sending me the logs. I think I found a bug which could explain what you are observing: We listen to newly added jobs and try to recover them independent of the leadership status. Due to this also a standby JobManager tries to recover a submitted job but won't execute it. Unfortunately, recovering a job already locks it without releasing the lock if it cannot be executed. I've documented the problem here [1]. This is a quite mean bug which we should fix asap. Thanks a lot for reporting the problem!


Cheers,
Till

On Wed, Aug 29, 2018 at 3:31 PM Encho Mishinev <[hidden email]> wrote:
Hi Till,

Those are actually the full logs except the two parts I shortened (pipeline construction and execution). As I said - accessing the UI for Jobmanager 2 redirects to Jobmanager 1 so it seems like he is aware that he is not the leader. Jobmanager 2 has no other logs than what I sent. Here is the full end-to-end log of Jobmanager 2 after repeating the experiment again:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sediVa6XS: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-zwdqv.
2018-08-29 13:19:24,047 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 13:19:24,049 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 13:19:24,049 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 13:19:24,367 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 13:19:24,435 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 13:19:24,435 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 13:19:24,436 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 13:19:24,461 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 13:19:24,461 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 13:19:24,472 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 13:19:24,506 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 13:19:24,522 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 13:19:24,532 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 13:19:24,996 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 13:19:25,050 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 13:19:25,209 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 13:19:25,216 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 13:19:25,648 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 13:19:25,702 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 13:19:25,703 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 13:19:25,750 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 13:19:25,756 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 13:19:25,756 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-zwdqv
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 13:19:25,758 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 13:19:25,775 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5000339768628554676.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 13:19:25,776 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 13:19:25,777 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 13:19:25,777 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-40cefeee-e8d1-4522-aea3-957d9f7fbeee
2018-08-29 13:19:25,777 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 13:19:25,778 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 13:19:25,788 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690009, negotiated timeout = 40000
2018-08-29 13:19:25,789 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 13:19:25,793 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 13:19:25,798 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-76cce4e7-84ea-4624-a847-bbd7fdc4f109, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 13:19:25,824 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-7c6c2db0-f7ab-4cb6-909d-6c9cbfd78215
2018-08-29 13:19:25,838 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 13:19:25,839 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 13:19:25,840 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 13:19:25,843 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 13:19:26,143 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 13:19:26,143 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 13:19:26,216 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 13:19:26,217 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 13:19:26,236 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 13:19:26,248 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 13:19:26,323 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 13:19:26,335 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 13:19:26,336 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 13:19:26,338 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 13:19:26,339 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 13:23:21,513 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(836b29f7c66bbeb6ed8bae41cb9b316c, null).

Thanks,
Encho

On Wed, Aug 29, 2018 at 3:59 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks for sending the first part of the logs. What I would actually be interested in are the complete logs because somewhere in the jobmanager-2 logs there must be a log statement saying that the respective dispatcher gained leadership. I would like to see why this happens but for this to debug the complete logs are necessary. It would be awesome if you could send them to me. Thanks a lot!

Cheers,
Till

On Wed, Aug 29, 2018 at 2:00 PM Encho Mishinev <[hidden email]> wrote:
Hi Till,

I will use the approach with a k8s deployment and HA mode with a single job manager. Nonetheless, here are the logs I just produced by repeating the aforementioned experiment, hope they help in debugging:

- Starting Jobmanager-1:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-1-f76fd4df8-ftwt9.
2018-08-29 11:41:48,806 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:49,134 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:49,215 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-1
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:49,250 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:49,282 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:49,298 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:49,309 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-1:50010
2018-08-29 11:41:49,768 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:49,823 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:49,974 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-1:50010]
2018-08-29 11:41:49,981 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-1:50010
2018-08-29 11:41:50,444 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:50,568 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:50,605 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a
2018-08-29 11:41:50,605 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:50,607 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:50,607 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:50,608 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:50,609 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:50,618 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout = 40000
2018-08-29 11:41:50,619 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:50,627 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:50,633 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:50,659 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184
2018-08-29 11:41:50,674 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:50,675 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:50,676 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:50,679 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-1:8081
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:51,091 WARN  org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths  - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
2018-08-29 11:41:51,101 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-1:8081.
2018-08-29 11:41:51,114 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:51,141 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://flink-jobmanager-1:8081 was granted leadership with leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd
2018-08-29 11:41:51,214 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:51,230 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:51,232 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:51,234 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:51,235 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 11:41:51,253 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49
2018-08-29 11:41:51,254 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-29 11:41:51,468 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under 1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager.
2018-08-29 11:41:51,471 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager 104d18b72fed054620e58e120a1ea083 under e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager.

Starting Jobmanager-2:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-kmvw9.
2018-08-29 11:41:51,688 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:52,018 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:52,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,122 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:52,123 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:52,133 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:52,173 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:52,188 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:52,198 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 11:41:52,753 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:52,822 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:53,038 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 11:41:53,046 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 11:41:53,500 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:53,558 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:53,559 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:53,616 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:53,624 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:53,644 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:53,646 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:53,646 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade
2018-08-29 11:41:53,646 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:53,647 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:53,649 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:53,655 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout = 40000
2018-08-29 11:41:53,656 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:53,667 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:53,673 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:53,699 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0
2018-08-29 11:41:53,717 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:53,718 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:53,719 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:53,722 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:54,180 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 11:41:54,192 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:54,273 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:54,286 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:54,287 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Upon submitting a batch job on Jobmanager-1, we immediately get this log on Jobmanager-2
2018-08-29 11:47:06,249 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null).

Meanwhile Jobmanager-1 gets:
-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)

2018-08-29 11:47:06,006 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting job d69b67e4d28a2d244b06d3f6d661bca1 (sicassandrawriterbeam-flink-0829114703-7d95fabd).
2018-08-29 11:47:06,090 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to ZooKeeper.

-loads of job execution info-

2018-08-29 11:49:20,272 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED.
2018-08-29 11:49:20,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the JobMaster for job sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1).
2018-08-29 11:49:20,290 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Close ResourceManager connection 827b94881bf7c94d8516907e04e3a564: JobManager is shutting down..
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}.
2018-08-29 11:49:20,304 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper.


-------------------

The result is:
HDFS has only a jobgraph and an empty default folder - everything else is cleared
ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the last log still there.

On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

it sounds strange that the standby JobManager tries to recover a submitted job graph. This should only happen if it has been granted leadership. Thus, it seems as if the standby JobManager thinks that it is also the leader. Could you maybe share the logs of the two JobManagers/ClusterEntrypoints with us?

Running only a single JobManager/ClusterEntrypoint in HA mode via a Kubernetes Deployment should do the trick and there is nothing wrong with it.

Cheers,
Till

On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <[hidden email]> wrote:
Hello,

Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an YARN cluster but does not specify such option for Kubernetes. Is there anything that can go wrong with this approach?

Thanks

On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <[hidden email]> wrote:
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Encho Mishinev
Hi Till,

That's great that you've traced the problem. It seems like a lot of people have been reporting similar problems. Thanks for reacting so quickly and good luck with fixing the bug. I will use a single JobManager with K8S Deployment for now, but look forward to the fix.

Thanks,
Encho

On Wed, Aug 29, 2018 at 4:43 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks for sending me the logs. I think I found a bug which could explain what you are observing: We listen to newly added jobs and try to recover them independent of the leadership status. Due to this also a standby JobManager tries to recover a submitted job but won't execute it. Unfortunately, recovering a job already locks it without releasing the lock if it cannot be executed. I've documented the problem here [1]. This is a quite mean bug which we should fix asap. Thanks a lot for reporting the problem!


Cheers,
Till

On Wed, Aug 29, 2018 at 3:31 PM Encho Mishinev <[hidden email]> wrote:
Hi Till,

Those are actually the full logs except the two parts I shortened (pipeline construction and execution). As I said - accessing the UI for Jobmanager 2 redirects to Jobmanager 1 so it seems like he is aware that he is not the leader. Jobmanager 2 has no other logs than what I sent. Here is the full end-to-end log of Jobmanager 2 after repeating the experiment again:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sediVa6XS: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-zwdqv.
2018-08-29 13:19:24,047 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 13:19:24,049 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 13:19:24,049 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 13:19:24,367 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 13:19:24,431 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 13:19:24,434 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 13:19:24,435 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 13:19:24,435 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 13:19:24,436 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 13:19:24,442 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 13:19:24,443 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 13:19:24,444 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 13:19:24,445 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 13:19:24,461 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 13:19:24,461 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 13:19:24,472 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 13:19:24,506 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 13:19:24,522 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 13:19:24,532 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 13:19:24,996 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 13:19:25,050 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 13:19:25,209 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 13:19:25,216 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 13:19:25,648 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 13:19:25,702 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 13:19:25,703 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 13:19:25,750 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 13:19:25,756 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 13:19:25,756 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-zwdqv
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 13:19:25,757 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 13:19:25,758 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 13:19:25,775 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5000339768628554676.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 13:19:25,776 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 13:19:25,777 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 13:19:25,777 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-40cefeee-e8d1-4522-aea3-957d9f7fbeee
2018-08-29 13:19:25,777 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 13:19:25,778 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 13:19:25,788 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690009, negotiated timeout = 40000
2018-08-29 13:19:25,789 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 13:19:25,793 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 13:19:25,798 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-76cce4e7-84ea-4624-a847-bbd7fdc4f109, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 13:19:25,824 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-7c6c2db0-f7ab-4cb6-909d-6c9cbfd78215
2018-08-29 13:19:25,838 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 13:19:25,839 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 13:19:25,840 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 13:19:25,843 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 13:19:26,143 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 13:19:26,143 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 13:19:26,216 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 13:19:26,217 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 13:19:26,236 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 13:19:26,248 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 13:19:26,323 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 13:19:26,335 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 13:19:26,336 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 13:19:26,338 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 13:19:26,339 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 13:23:21,513 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(836b29f7c66bbeb6ed8bae41cb9b316c, null).

Thanks,
Encho

On Wed, Aug 29, 2018 at 3:59 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks for sending the first part of the logs. What I would actually be interested in are the complete logs because somewhere in the jobmanager-2 logs there must be a log statement saying that the respective dispatcher gained leadership. I would like to see why this happens but for this to debug the complete logs are necessary. It would be awesome if you could send them to me. Thanks a lot!

Cheers,
Till

On Wed, Aug 29, 2018 at 2:00 PM Encho Mishinev <[hidden email]> wrote:
Hi Till,

I will use the approach with a k8s deployment and HA mode with a single job manager. Nonetheless, here are the logs I just produced by repeating the aforementioned experiment, hope they help in debugging:

- Starting Jobmanager-1:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedR98XPn: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-1-f76fd4df8-ftwt9.
2018-08-29 11:41:48,806 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:48,807 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:49,134 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:49,210 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:49,213 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:49,214 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:49,215 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-1
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:49,221 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:49,222 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:49,223 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:49,224 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:49,225 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:49,239 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:49,250 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:49,282 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:49,298 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:49,309 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-1:50010
2018-08-29 11:41:49,768 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:49,823 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:49,974 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-1:50010]
2018-08-29 11:41:49,981 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-1:50010
2018-08-29 11:41:50,444 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:50,509 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:50,568 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-1-f76fd4df8-ftwt9
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:50,577 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:50,578 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:50,605 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-d408cea8-2ed0-461a-a30a-a62b70fd332a
2018-08-29 11:41:50,605 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-5372401662150571998.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:50,607 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:50,607 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:50,608 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:50,609 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:50,618 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690005, negotiated timeout = 40000
2018-08-29 11:41:50,619 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:50,627 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:50,633 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-c5df0b39-86f3-4fba-bdda-aacca4f86086, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:50,659 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-c12d55af-3c2d-4fc2-8ee8-6de642522184
2018-08-29 11:41:50,674 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:50,675 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:50,676 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:50,679 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:50,995 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-1:8081
2018-08-29 11:41:51,071 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:51,091 WARN  org.apache.flink.shaded.curator.org.apache.curator.utils.ZKPaths  - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
2018-08-29 11:41:51,101 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-1:8081.
2018-08-29 11:41:51,114 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:51,141 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://flink-jobmanager-1:8081 was granted leadership with leaderSessionID=bb0d4dfd-c2c4-480b-bc86-62e231a606dd
2018-08-29 11:41:51,214 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:51,230 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:51,232 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:51,234 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:51,235 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-08-29 11:41:51,253 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@flink-jobmanager-1:50010/user/resourcemanager was granted leadership with fencing token ba47ed8daa8ff16bea6fc355c13f4d49
2018-08-29 11:41:51,254 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@flink-jobmanager-1:50010/user/dispatcher was granted leadership with fencing token 703301bf-85e7-4464-990f-ad39128a7b4d
2018-08-29 11:41:51,263 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2018-08-29 11:41:51,468 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager c8a3201d58d87dbbe16f8eb352b5c5b6 under 1c5bf0bc3848bd384b6f032ff7213754 at the SlotManager.
2018-08-29 11:41:51,471 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Registering TaskManager 104d18b72fed054620e58e120a1ea083 under e9d3e8ad3b477dd2e58bcb88a2c0d061 at the SlotManager.

Starting Jobmanager-2:

Starting Job Manager
sed: cannot rename /opt/flink/conf/sedH2ZiSu: Device or resource busy
config file:
jobmanager.rpc.address: flink-jobmanager-2
jobmanager.rpc.port: 6123
jobmanager.heap.size: 8192
taskmanager.heap.size: 8192
taskmanager.numberOfTaskSlots: 4
high-availability: zookeeper
high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
high-availability.zookeeper.quorum: zk-cs:2181
high-availability.zookeeper.path.root: /flink
high-availability.jobmanager.port: 50010
state.backend: filesystem
state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
state.savepoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
state.backend.incremental: false
fs.default-scheme: hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
rest.port: 8081
web.upload.dir: /opt/flink/upload
query.server.port: 6125
taskmanager.numberOfTaskSlots: 4
classloader.parent-first-patterns.additional: org.apache.xerces.
blob.storage.directory: /opt/flink/blob-server
blob.server.port: 6124
blob.server.port: 6124
query.server.port: 6125
Starting standalonesession as a console application on host flink-jobmanager-2-7844b78c9-kmvw9.
2018-08-29 11:41:51,688 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Starting StandaloneSessionClusterEntrypoint (Version: 1.5.3, Rev:614f216, Date:16.08.2018 @ 06:39:50 GMT)
2018-08-29 11:41:51,690 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  OS current user: flink
2018-08-29 11:41:52,018 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Maximum heap size: 6702 MiBytes
2018-08-29 11:41:52,088 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JAVA_HOME: /docker-java-home/jre
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop version: 2.7.5
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  JVM Options:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Program Arguments:
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --configDir
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     /opt/flink/conf
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --executionMode
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     --host
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -     cluster
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Classpath: /opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:52,091 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2018-08-29 11:41:52,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, flink-jobmanager-2
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-08-29 11:41:52,103 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 8192
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability, zookeeper
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.storageDir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.quorum, zk-cs:2181
2018-08-29 11:41:52,104 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.zookeeper.path.root, /flink
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: high-availability.jobmanager.port, 50010
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend, filesystem
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.checkpoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/checkpoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.savepoints.dir, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/savepoints
2018-08-29 11:41:52,105 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: state.backend.incremental, false
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: fs.default-scheme, hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: web.upload.dir, /opt/flink/upload
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,106 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 4
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: classloader.parent-first-patterns.additional, org.apache.xerces.
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.storage.directory, /opt/flink/blob-server
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: blob.server.port, 6124
2018-08-29 11:41:52,107 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: query.server.port, 6125
2018-08-29 11:41:52,122 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Starting StandaloneSessionClusterEntrypoint.
2018-08-29 11:41:52,123 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install default filesystem.
2018-08-29 11:41:52,133 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Install security context.
2018-08-29 11:41:52,173 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to flink (auth:SIMPLE)
2018-08-29 11:41:52,188 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Initializing cluster services.
2018-08-29 11:41:52,198 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at flink-jobmanager-2:50010
2018-08-29 11:41:52,753 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-08-29 11:41:52,822 INFO  akka.remote.Remoting                                          - Starting remoting
2018-08-29 11:41:53,038 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@flink-jobmanager-2:50010]
2018-08-29 11:41:53,046 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink@flink-jobmanager-2:50010
2018-08-29 11:41:53,500 INFO  org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating highly available BLOB storage directory at hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020/flink/high-availability//default/blob
2018-08-29 11:41:53,558 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Enforcing default ACL for ZK connections
2018-08-29 11:41:53,559 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.
2018-08-29 11:41:53,616 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Starting
2018-08-29 11:41:53,624 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:host.name=flink-jobmanager-2-7844b78c9-kmvw9
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.version=1.8.0_181
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.vendor=Oracle Corporation
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.class.path=/opt/flink/lib/flink-python_2.11-1.5.3.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.5.3.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.5.3.jar:::
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.io.tmpdir=/tmp
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:java.compiler=<NA>
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.name=Linux
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.arch=amd64
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:os.version=4.4.0-1027-gke
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.name=flink
2018-08-29 11:41:53,625 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.home=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Client environment:user.dir=/opt/flink
2018-08-29 11:41:53,626 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating client connection, connectString=zk-cs:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@17ae7628
2018-08-29 11:41:53,644 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-8238466329925822361.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2018-08-29 11:41:53,646 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening socket connection to server zk-cs.default.svc.cluster.local/10.27.248.104:2181
2018-08-29 11:41:53,646 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /opt/flink/blob-server/blobStore-61cdb645-5d0c-47fd-bcf6-84ad16fadade
2018-08-29 11:41:53,646 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Authentication failed
2018-08-29 11:41:53,647 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket connection established to zk-cs.default.svc.cluster.local/10.27.248.104:2181, initiating session
2018-08-29 11:41:53,649 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2018-08-29 11:41:53,655 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session establishment complete on server zk-cs.default.svc.cluster.local/10.27.248.104:2181, sessionid = 0x26584fd55690006, negotiated timeout = 40000
2018-08-29 11:41:53,656 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2018-08-29 11:41:53,667 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-08-29 11:41:53,673 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory /tmp/executionGraphStore-8b236c14-79ee-4a84-b23f-437408c4661a, expiration time 3600000, maximum cache size 52428800 bytes.
2018-08-29 11:41:53,699 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /opt/flink/blob-server/blobStore-80c519df-cc6f-4e9c-9cd5-da4077c826f0
2018-08-29 11:41:53,717 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-08-29 11:41:53,718 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory /opt/flink/upload/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-08-29 11:41:53,719 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory /opt/flink/upload/flink-web-upload for file uploads.
2018-08-29 11:41:53,722 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2018-08-29 11:41:54,084 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at flink-jobmanager-2:8081
2018-08-29 11:41:54,160 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2018-08-29 11:41:54,180 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://flink-jobmanager-2:8081.
2018-08-29 11:41:54,192 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2018-08-29 11:41:54,273 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-08-29 11:41:54,286 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2018-08-29 11:41:54,287 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2018-08-29 11:41:54,289 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Upon submitting a batch job on Jobmanager-1, we immediately get this log on Jobmanager-2
2018-08-29 11:47:06,249 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null).

Meanwhile Jobmanager-1 gets:
-FlinkBatchPipelineTranslator pipeline logs- (we use Apache Beam)

2018-08-29 11:47:06,006 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting job d69b67e4d28a2d244b06d3f6d661bca1 (sicassandrawriterbeam-flink-0829114703-7d95fabd).
2018-08-29 11:47:06,090 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Added SubmittedJobGraph(d69b67e4d28a2d244b06d3f6d661bca1, null) to ZooKeeper.

-loads of job execution info-

2018-08-29 11:49:20,272 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job d69b67e4d28a2d244b06d3f6d661bca1 reached globally terminal state FINISHED.
2018-08-29 11:49:20,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping the JobMaster for job sicassandrawriterbeam-flink-0829114703-7d95fabd(d69b67e4d28a2d244b06d3f6d661bca1).
2018-08-29 11:49:20,290 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Close ResourceManager connection 827b94881bf7c94d8516907e04e3a564: JobManager is shutting down..
2018-08-29 11:49:20,292 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping SlotPool.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Disconnect job manager [hidden email]://flink@flink-jobmanager-1:50010/user/jobmanager_0 for job d69b67e4d28a2d244b06d3f6d661bca1 from the resource manager.
2018-08-29 11:49:20,293 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/d69b67e4d28a2d244b06d3f6d661bca1/job_manager_lock'}.
2018-08-29 11:49:20,304 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph d69b67e4d28a2d244b06d3f6d661bca1 from ZooKeeper.


-------------------

The result is:
HDFS has only a jobgraph and an empty default folder - everything else is cleared
ZooKeeper has the jobgraph that Jobmanager-1 claims to have removed in the last log still there.

On Wed, Aug 29, 2018 at 12:14 PM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

it sounds strange that the standby JobManager tries to recover a submitted job graph. This should only happen if it has been granted leadership. Thus, it seems as if the standby JobManager thinks that it is also the leader. Could you maybe share the logs of the two JobManagers/ClusterEntrypoints with us?

Running only a single JobManager/ClusterEntrypoint in HA mode via a Kubernetes Deployment should do the trick and there is nothing wrong with it.

Cheers,
Till

On Wed, Aug 29, 2018 at 11:05 AM Encho Mishinev <[hidden email]> wrote:
Hello,

Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an YARN cluster but does not specify such option for Kubernetes. Is there anything that can go wrong with this approach?

Thanks

On Wed, Aug 29, 2018 at 11:10 AM Encho Mishinev <[hidden email]> wrote:
Hi,

Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. 

Any ideas for workarounds are welcome, as this problem basically prevents me from using HA.

Thanks,
Encho

On Wed, Aug 29, 2018 at 5:15 AM vino yang <[hidden email]> wrote:
Hi Encho,

From your description, I feel that there are extra bugs.

About your description:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

Is it necessarily happening every time?

In the Standalone cluster, the problems we encountered were sporadic.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午8:07写道:
Hello Till,

I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1 quickly after I manually kill it, so when I stop the job on JobManager 2 both are alive.

I did a very simple test:

- Start both job managers
- Start a batch job in JobManager 1 and let it finish
The jobgraphs in both Zookeeper and HDFS remained.

On the other hand if we do:

- Start only JobManager 1 (again in HA mode)
- Start a batch job and let it finish
The jobgraphs in both Zookeeper and HDFS are deleted fine.

It seems like the standby manager still leaves some kind of lock on the jobgraphs. Do you think that's possible? Have you seen a similar problem?
The only logs that appear on the standby manager while waiting are of the type:

2018-08-28 11:54:10,789 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(9e0a109b57511930c95d3b54574a66e3, null).

Note that this log appears on the standby jobmanager immediately when a new job is submitted to the active jobmanager.
Also note that the blobs and checkpoints are cleared fine. The problem is only for jobgraphs both in ZooKeeper and HDFS.

Trying to access the UI of the standby manager redirects to the active one, so it is not a problem of them not knowing who the leader is. Do you have any ideas?

Thanks a lot,
Encho

On Tue, Aug 28, 2018 at 10:27 AM Till Rohrmann <[hidden email]> wrote:
Hi Encho,

thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to delete the persisted job graph. So one thing to check is whether the old JM still has an open connection to ZooKeeper. The next thing to check is the session timeout of your ZooKeeper cluster. If you stop the job within the session timeout, then it is also not guaranteed that ZooKeeper has detected that the ephemeral nodes of the old JM must be deleted. In order to understand this better it would be helpful if you could tell us the timing of the different actions.

Cheers,
Till

On Tue, Aug 28, 2018 at 8:17 AM vino yang <[hidden email]> wrote:
Hi Encho,

A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". 
Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to obtain the corresponding lock, so it is dangerous to do so, and it is not recommended. 
I think maybe this problem can be solved in the next version. It depends on Till.

Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月28日周二 下午1:17写道:
Thank you very much for the info! Will keep track of the progress. 

In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug.

On Tue, Aug 28, 2018 at 4:52 AM vino yang <[hidden email]> wrote:
About some implementation mechanisms. 
Flink uses Zookeeper to store JobGraph (Job's description information and metadata) as a basis for Job recovery. 
However, previous implementations may cause this information to not be properly cleaned up because it is asynchronously deleted by a background thread.

Thanks, vino.

vino yang <[hidden email]> 于2018年8月28日周二 上午9:49写道:
Hi Encho,

This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue.


Thanks, vino.

Encho Mishinev <[hidden email]> 于2018年8月27日周一 下午10:13写道:
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode.

My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2 checkpoints to succeed
- Kill pod of jobmanager-1
After a short delay, jobmanager-2 takes leadership and correctly restores the job and continues it
- Stop job from jobmanager-2

At this point all seems well, but the problem is that jobmanager-2 does not clean up anything that was left from jobmanager-1. This means that both in HDFS and in Zookeeper remain job graphs, which later on obstruct any work of both managers as after any reset they unsuccessfully try to restore a non-existent job and fail over and over again.

I am quite certain that jobmanager-2 does not know about any of jobmanager-1’s files since the Zookeeper logs reveal that it tries to duplicate job folders:

2018-08-27 13:11:00,038 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x46 zxid:0x1ab txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77 Error:KeeperErrorCode = NodeExists for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15/bbb259fd-7826-4950-bc7c-c2be23346c77

2018-08-27 13:11:02,296 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:create cxid:0x5c zxid:0x1ac txntype:-1 reqpath:n/a Error Path:/flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = NodeExists for /flink/default/checkpoint-counter/83bfa359ca59ce1d4635e18e16651e15

Also jobmanager-2 attempts to delete the jobgraphs folder in Zookeeper when the job is stopped, but fails since there are leftover files in it from jobmanager-1:

2018-08-27 13:12:13,406 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x1657aa15e480033 type:delete cxid:0xa8 zxid:0x1bd txntype:-1 reqpath:n/a Error Path:/flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15 Error:KeeperErrorCode = Directory not empty for /flink/default/jobgraphs/83bfa359ca59ce1d4635e18e16651e15

I’ve noticed that when restoring the job, it seems like jobmanager-2 does not get anything more than jobID, while it perhaps needs some metadata? Here is the log that seems suspicious to me:

2018-08-27 13:09:18,113 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(83bfa359ca59ce1d4635e18e16651e15, null).

All other logs seem fine in jobmanager-2, it doesn’t seem to be aware that it’s overwriting anything or not deleting properly.

My question is - what is the intended way for the job managers to correctly exchange metadata in HA mode and why is it not working for me?

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

seuzxc
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):  


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Vijay Bhaskar
Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复: JobGraphs not cleaned up in HA mode

seuzxc
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Vijay Bhaskar
Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount. 

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/