Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

chiggi_dev
Hi,

I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 

JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :

org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager
 Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f


But TaskManagers are not able to register with the JobManager and gives the following error:

Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 

One interesting thing I observed was a ZK version log:

The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?

My configuration:

Flink Version : 1.4.0
ZK version : 3.4.11 (I just pulled the latest image)

Thanks in advance. 

Chirag

Reply | Threaded
Open this post in threaded view
|

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Aljoscha Krettek
Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.

Best,
Aljoscha

On 14. Feb 2018, at 06:12, Chirag Dewan <[hidden email]> wrote:

Hi,

I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 

JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :

org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
JobManager <a href="akka.tcp://flink@jobmanager:6123/user/jobmanager" class="">akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader <a href="akka.tcp://flink@jobmanager:6123/user/jobmanager" class="">akka.tcp://flink@jobmanager:6123/user/jobmanager
 Resource Manager associating with leading JobManager Actor[<a href="akka://flink/user/jobmanager#590681231" class="">akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f


But TaskManagers are not able to register with the JobManager and gives the following error:

Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 

One interesting thing I observed was a ZK version log:

The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?

My configuration:

Flink Version : 1.4.0
ZK version : 3.4.11 (I just pulled the latest image)

Thanks in advance. 

Chirag


Reply | Threaded
Open this post in threaded view
|

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

chiggi_dev
Thanks Aljoscha.

I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?

Regards,

Chirag


On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek
Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.

Best,
Aljoscha

On 14. Feb 2018, at 06:12, Chirag Dewan <[hidden email]> wrote:

Hi,

I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 

JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :

org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager
 Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f


But TaskManagers are not able to register with the JobManager and gives the following error:

Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 

One interesting thing I observed was a ZK version log:

The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?

My configuration:

Flink Version : 1.4.0
ZK version : 3.4.11 (I just pulled the latest image)

Thanks in advance. 

Chirag


Reply | Threaded
Open this post in threaded view
|

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Aljoscha Krettek
It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode

On 14. Feb 2018, at 15:32, Chirag Dewan <[hidden email]> wrote:

Thanks Aljoscha.

I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?

Regards,

Chirag


On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek
Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.

Best,
Aljoscha

On 14. Feb 2018, at 06:12, Chirag Dewan <[hidden email]> wrote:

Hi,

I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 

JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :

org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager
 Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f


But TaskManagers are not able to register with the JobManager and gives the following error:

Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 

One interesting thing I observed was a ZK version log:

The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?

My configuration:

Flink Version : 1.4.0
ZK version : 3.4.11 (I just pulled the latest image)

Thanks in advance. 

Chirag



Reply | Threaded
Open this post in threaded view
|

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

chiggi_dev
Thanks a lot Aljoscha.

I was doing a silly mistake. TaskManagers can now register with JobManager.

One more thing, does Flink now store Job Graphs on ZK too?

Regards,

Chirag

On Wednesday, 14 February, 2018, 8:06:14 PM IST, Aljoscha Krettek <[hidden email]> wrote:


It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode

On 14. Feb 2018, at 15:32, Chirag Dewan <[hidden email]> wrote:

Thanks Aljoscha.

I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?

Regards,

Chirag


On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek
Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.

Best,
Aljoscha

On 14. Feb 2018, at 06:12, Chirag Dewan <[hidden email]> wrote:

Hi,

I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 

JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :

org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager
 Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f


But TaskManagers are not able to register with the JobManager and gives the following error:

Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 

One interesting thing I observed was a ZK version log:

The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?

My configuration:

Flink Version : 1.4.0
ZK version : 3.4.11 (I just pulled the latest image)

Thanks in advance. 

Chirag



Reply | Threaded
Open this post in threaded view
|

Re: Deploying Flink with JobManager HA on Docker Swarm/Kubernetes

Aljoscha Krettek
Hi,

AFAIK, the JobGraph itself is not stored in ZK but in HDFS. ZK only stores a handle to the serialised JobGraph.

Best,
Aljoscha

On 15. Feb 2018, at 04:59, Chirag Dewan <[hidden email]> wrote:

Thanks a lot Aljoscha.

I was doing a silly mistake. TaskManagers can now register with JobManager.

One more thing, does Flink now store Job Graphs on ZK too?

Regards,

Chirag

On Wednesday, 14 February, 2018, 8:06:14 PM IST, Aljoscha Krettek <[hidden email]> wrote:


It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode

On 14. Feb 2018, at 15:32, Chirag Dewan <[hidden email]> wrote:

Thanks Aljoscha.

I haven't checked that bit. Is there any configuration for TaskManagers to find ZK?

Regards,

Chirag


On Wed, 14 Feb 2018 at 7:43 PM, Aljoscha Krettek
Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.

Best,
Aljoscha

On 14. Feb 2018, at 06:12, Chirag Dewan <[hidden email]> wrote:

Hi,

I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. 

JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) :

org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService   org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f).
 Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager
 Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f


But TaskManagers are not able to register with the JobManager and gives the following error:

Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000.

Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. 

One interesting thing I observed was a ZK version log:

The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

Is this a ZK version problem? Should I be using ZK 3.4.6?

My configuration:

Flink Version : 1.4.0
ZK version : 3.4.11 (I just pulled the latest image)

Thanks in advance. 

Chirag