Best way to find the current alive jobmanager with HA mode zookeeper

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Best way to find the current alive jobmanager with HA mode zookeeper

Yuan,Youjun

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan

Reply | Threaded
Open this post in threaded view
|

***UNCHECKED*** Re: Best way to find the current alive jobmanager with HA mode zookeeper

vino yang
Hi Yuan Youjun,

Actually, RestClusterClient has a method named getWebMonitorBaseUrl which will retrieve the webmonitor's leader address when you submit job automatically.[1]

Ideally, you do not need to retrieve JM by yourself. Currently, the webmonitor is binding with JobManager, maybe if JM failover, you can not find new web monitor?

Flink provided a component named "LeaderRetrievalService" to retrieval many compoment's leader, based on Zookeeper, there is a implementation named "ZooKeeperLeaderRetrievalService".

In ZooKeeperHaServices, it provided a method named "getWebMonitorLeaderRetriever" to retrieve the web monitor's leader and provided a method named "getJobManagerLeaderRetriever" to retrieve JobManager's leader.  And ClusterClient#getJobManagerGateway used it.


Thanks, vino.


2018-07-25 11:37 GMT+08:00 Yuan,Youjun <[hidden email]>:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan


Reply | Threaded
Open this post in threaded view
|

Re: Best way to find the current alive jobmanager with HA mode zookeeper

Martin Eden
In reply to this post by Yuan,Youjun
Hi,

This is actually very relevant to us as well.

We want to deploy Flink 1.3.2 on a 3 node DCOS cluster. In the case of Mesos/DCOS, Flink HA runs only one JobManager which gets restarted on another node by Marathon in case of failure and re-load it's state from Zookeeper.

Yuan I am guessing you are using Flink in standalone mode and there it is actually running 3 instances of the Job Manager, 1 active and 2 stand-bys.

Either way, in both cases there is the need to "discover" the hostname and port of the Job Manager at runtime. This is needed when you want to use the cli to submit jobs for instance. Is there an elegant mode to submit jobs other than say just trying out all the possible nodes in your cluster?

Grateful if anyone could clarify any of the above, thanks,
M

On Wed, Jul 25, 2018 at 11:37 AM, Yuan,Youjun <[hidden email]> wrote:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan


Reply | Threaded
Open this post in threaded view
|

Re: Best way to find the current alive jobmanager with HA mode zookeeper

vino yang
Hi Martin,


For a standalone cluster which exists multiple JM instances, If you do not use Rest API, but use Flink provided Cluster client. The client can perceive which one this the JM leader from multiple JM instances.

For example, you can use CLI to submit flink job in a non-Leader node.

But I did not verify this case for Flink on Mesos.

Thanks, vino.

2018-07-25 17:22 GMT+08:00 Martin Eden <[hidden email]>:
Hi,

This is actually very relevant to us as well.

We want to deploy Flink 1.3.2 on a 3 node DCOS cluster. In the case of Mesos/DCOS, Flink HA runs only one JobManager which gets restarted on another node by Marathon in case of failure and re-load it's state from Zookeeper.

Yuan I am guessing you are using Flink in standalone mode and there it is actually running 3 instances of the Job Manager, 1 active and 2 stand-bys.

Either way, in both cases there is the need to "discover" the hostname and port of the Job Manager at runtime. This is needed when you want to use the cli to submit jobs for instance. Is there an elegant mode to submit jobs other than say just trying out all the possible nodes in your cluster?

Grateful if anyone could clarify any of the above, thanks,
M

On Wed, Jul 25, 2018 at 11:37 AM, Yuan,Youjun <[hidden email]> wrote:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan



Reply | Threaded
Open this post in threaded view
|

答复: Best way to find the current alive jobmanager with HA mode zookeeper

Yuan,Youjun

Thanks for the information. Forgot to mention, I am using Flink 1.4, the RestClusterClient seems don’t have the ability to retrieve the leader address. I did notice there is webMonitorRetrievalService member in Flink 1.5.

 

I wonder if I can use [hidden email] on my client side, to retrieve the leader JM of Flink v1.4 Cluster.

 

Thanks

Youjun

 

发件人: vino yang <[hidden email]>
发送时间: Wednesday, July 25, 2018 7:11 PM
收件人: Martin Eden <[hidden email]>
抄送: Yuan,Youjun <[hidden email]>; [hidden email]
主题: Re: Best way to find the current alive jobmanager with HA mode zookeeper

 

Hi Martin,

 

 

For a standalone cluster which exists multiple JM instances, If you do not use Rest API, but use Flink provided Cluster client. The client can perceive which one this the JM leader from multiple JM instances.

 

For example, you can use CLI to submit flink job in a non-Leader node.

 

But I did not verify this case for Flink on Mesos.

 

Thanks, vino.

 

2018-07-25 17:22 GMT+08:00 Martin Eden <[hidden email]>:

Hi,

 

This is actually very relevant to us as well.

 

We want to deploy Flink 1.3.2 on a 3 node DCOS cluster. In the case of Mesos/DCOS, Flink HA runs only one JobManager which gets restarted on another node by Marathon in case of failure and re-load it's state from Zookeeper.

 

Yuan I am guessing you are using Flink in standalone mode and there it is actually running 3 instances of the Job Manager, 1 active and 2 stand-bys.

 

Either way, in both cases there is the need to "discover" the hostname and port of the Job Manager at runtime. This is needed when you want to use the cli to submit jobs for instance. Is there an elegant mode to submit jobs other than say just trying out all the possible nodes in your cluster?

 

Grateful if anyone could clarify any of the above, thanks,

M

 

On Wed, Jul 25, 2018 at 11:37 AM, Yuan,Youjun <[hidden email]> wrote:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan

 

 

Reply | Threaded
Open this post in threaded view
|

Re: 答复: Best way to find the current alive jobmanager with HA mode zookeeper

vino yang
Hi Youjun,

Thanks, you can try this but I am not sure if it works correctly. Because for the REST Client, there are quite a few changes from 1.4 to 1.5.

Maybe you can customize the source code in 1.4 refer to specific implementation of 1.5? Another option, upgrade your Flink version.

To Chesnay and Till:  any suggestion or opinion?

Thanks, vino.

2018-07-26 10:01 GMT+08:00 Yuan,Youjun <[hidden email]>:

Thanks for the information. Forgot to mention, I am using Flink 1.4, the RestClusterClient seems don’t have the ability to retrieve the leader address. I did notice there is webMonitorRetrievalService member in Flink 1.5.

 

I wonder if I can use [hidden email] on my client side, to retrieve the leader JM of Flink v1.4 Cluster.

 

Thanks

Youjun

 

发件人: vino yang <[hidden email]>
发送时间: Wednesday, July 25, 2018 7:11 PM
收件人: Martin Eden <[hidden email]>
抄送: Yuan,Youjun <[hidden email]>; [hidden email]
主题: Re: Best way to find the current alive jobmanager with HA mode zookeeper

 

Hi Martin,

 

 

For a standalone cluster which exists multiple JM instances, If you do not use Rest API, but use Flink provided Cluster client. The client can perceive which one this the JM leader from multiple JM instances.

 

For example, you can use CLI to submit flink job in a non-Leader node.

 

But I did not verify this case for Flink on Mesos.

 

Thanks, vino.

 

2018-07-25 17:22 GMT+08:00 Martin Eden <[hidden email]>:

Hi,

 

This is actually very relevant to us as well.

 

We want to deploy Flink 1.3.2 on a 3 node DCOS cluster. In the case of Mesos/DCOS, Flink HA runs only one JobManager which gets restarted on another node by Marathon in case of failure and re-load it's state from Zookeeper.

 

Yuan I am guessing you are using Flink in standalone mode and there it is actually running 3 instances of the Job Manager, 1 active and 2 stand-bys.

 

Either way, in both cases there is the need to "discover" the hostname and port of the Job Manager at runtime. This is needed when you want to use the cli to submit jobs for instance. Is there an elegant mode to submit jobs other than say just trying out all the possible nodes in your cluster?

 

Grateful if anyone could clarify any of the above, thanks,

M

 

On Wed, Jul 25, 2018 at 11:37 AM, Yuan,Youjun <[hidden email]> wrote:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan

 

 


Reply | Threaded
Open this post in threaded view
|

Re: 答复: Best way to find the current alive jobmanager with HA mode zookeeper

Till Rohrmann
I think that the web ui automatically redirects to the current leader. So if you should access the JobManager which is not leader, then you should get an HTTP redirect to the current leader. Due to that it should not be strictly necessary to know which of the JobManagers is the leader.

The RestClusterClient uses the ZooKeeperLeaderRetrievalService to retrieve the leader address. You could try the same. Using the RestClusterClient with Flink 1.4 won't work, though. Alternatively, you should be able to directly read the address from the leader ZNode in ZooKeeper.

Cheers,
Till



On Thu, Jul 26, 2018 at 4:14 AM vino yang <[hidden email]> wrote:
Hi Youjun,

Thanks, you can try this but I am not sure if it works correctly. Because for the REST Client, there are quite a few changes from 1.4 to 1.5.

Maybe you can customize the source code in 1.4 refer to specific implementation of 1.5? Another option, upgrade your Flink version.

To Chesnay and Till:  any suggestion or opinion?

Thanks, vino.

2018-07-26 10:01 GMT+08:00 Yuan,Youjun <[hidden email]>:

Thanks for the information. Forgot to mention, I am using Flink 1.4, the RestClusterClient seems don’t have the ability to retrieve the leader address. I did notice there is webMonitorRetrievalService member in Flink 1.5.

 

I wonder if I can use [hidden email] on my client side, to retrieve the leader JM of Flink v1.4 Cluster.

 

Thanks

Youjun

 

发件人: vino yang <[hidden email]>
发送时间: Wednesday, July 25, 2018 7:11 PM
收件人: Martin Eden <[hidden email]>
抄送: Yuan,Youjun <[hidden email]>; [hidden email]
主题: Re: Best way to find the current alive jobmanager with HA mode zookeeper

 

Hi Martin,

 

 

For a standalone cluster which exists multiple JM instances, If you do not use Rest API, but use Flink provided Cluster client. The client can perceive which one this the JM leader from multiple JM instances.

 

For example, you can use CLI to submit flink job in a non-Leader node.

 

But I did not verify this case for Flink on Mesos.

 

Thanks, vino.

 

2018-07-25 17:22 GMT+08:00 Martin Eden <[hidden email]>:

Hi,

 

This is actually very relevant to us as well.

 

We want to deploy Flink 1.3.2 on a 3 node DCOS cluster. In the case of Mesos/DCOS, Flink HA runs only one JobManager which gets restarted on another node by Marathon in case of failure and re-load it's state from Zookeeper.

 

Yuan I am guessing you are using Flink in standalone mode and there it is actually running 3 instances of the Job Manager, 1 active and 2 stand-bys.

 

Either way, in both cases there is the need to "discover" the hostname and port of the Job Manager at runtime. This is needed when you want to use the cli to submit jobs for instance. Is there an elegant mode to submit jobs other than say just trying out all the possible nodes in your cluster?

 

Grateful if anyone could clarify any of the above, thanks,

M

 

On Wed, Jul 25, 2018 at 11:37 AM, Yuan,Youjun <[hidden email]> wrote:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan

 

 


Reply | Threaded
Open this post in threaded view
|

Re: 答复: Best way to find the current alive jobmanager with HA mode zookeeper

Martin Eden
Hi guys,

Just to close the loop, with the Flink 1.3.2 cli you have to provide the Flink Job Manager host address in order to submit a job like so:
${FLINK_HOME}/bin/flink run -d -m ${FLINK_JOBMANAGER_ADDRESS} ${JOB_JAR}

Since we are running the DCOS Flink package we use the Marathon rest api to fetch the FLINK_JOBMANAGER_ADDRESS which solved our problem.

We are now thinking of upgrading to the latest 1.6 release. From looking at the cli docs and from the previous messages it seems you still need to provide the Job Manager address explicitly. Are there any plans to support job submission that just takes a zookeeper ensemble and zookeeperNamespace (which is currently accepted) without having to provide explicit Job Manager address? This would be more user friendly and would eliminate the extra step of figuring out the Job Manager address.

Thanks,
M



On Tue, Jul 31, 2018 at 3:54 PM, Till Rohrmann <[hidden email]> wrote:
I think that the web ui automatically redirects to the current leader. So if you should access the JobManager which is not leader, then you should get an HTTP redirect to the current leader. Due to that it should not be strictly necessary to know which of the JobManagers is the leader.

The RestClusterClient uses the ZooKeeperLeaderRetrievalService to retrieve the leader address. You could try the same. Using the RestClusterClient with Flink 1.4 won't work, though. Alternatively, you should be able to directly read the address from the leader ZNode in ZooKeeper.

Cheers,
Till



On Thu, Jul 26, 2018 at 4:14 AM vino yang <[hidden email]> wrote:
Hi Youjun,

Thanks, you can try this but I am not sure if it works correctly. Because for the REST Client, there are quite a few changes from 1.4 to 1.5.

Maybe you can customize the source code in 1.4 refer to specific implementation of 1.5? Another option, upgrade your Flink version.

To Chesnay and Till:  any suggestion or opinion?

Thanks, vino.

2018-07-26 10:01 GMT+08:00 Yuan,Youjun <[hidden email]>:

Thanks for the information. Forgot to mention, I am using Flink 1.4, the RestClusterClient seems don’t have the ability to retrieve the leader address. I did notice there is webMonitorRetrievalService member in Flink 1.5.

 

I wonder if I can use [hidden email] on my client side, to retrieve the leader JM of Flink v1.4 Cluster.

 

Thanks

Youjun

 

发件人: vino yang <[hidden email]>
发送时间: Wednesday, July 25, 2018 7:11 PM
收件人: Martin Eden <[hidden email]>
抄送: Yuan,Youjun <[hidden email]>; [hidden email]
主题: Re: Best way to find the current alive jobmanager with HA mode zookeeper

 

Hi Martin,

 

 

For a standalone cluster which exists multiple JM instances, If you do not use Rest API, but use Flink provided Cluster client. The client can perceive which one this the JM leader from multiple JM instances.

 

For example, you can use CLI to submit flink job in a non-Leader node.

 

But I did not verify this case for Flink on Mesos.

 

Thanks, vino.

 

2018-07-25 17:22 GMT+08:00 Martin Eden <[hidden email]>:

Hi,

 

This is actually very relevant to us as well.

 

We want to deploy Flink 1.3.2 on a 3 node DCOS cluster. In the case of Mesos/DCOS, Flink HA runs only one JobManager which gets restarted on another node by Marathon in case of failure and re-load it's state from Zookeeper.

 

Yuan I am guessing you are using Flink in standalone mode and there it is actually running 3 instances of the Job Manager, 1 active and 2 stand-bys.

 

Either way, in both cases there is the need to "discover" the hostname and port of the Job Manager at runtime. This is needed when you want to use the cli to submit jobs for instance. Is there an elegant mode to submit jobs other than say just trying out all the possible nodes in your cluster?

 

Grateful if anyone could clarify any of the above, thanks,

M

 

On Wed, Jul 25, 2018 at 11:37 AM, Yuan,Youjun <[hidden email]> wrote:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan

 

 



Reply | Threaded
Open this post in threaded view
|

Re: 答复: Best way to find the current alive jobmanager with HA mode zookeeper

Till Rohrmann
Hi Martin,

when configuring Flink to use the ZooKeeper HA mode, then it won't be necessary to specify the leader's address manually. The CLI will ask ZooKeeper for the leader information and send the request to the current leader. This should work with at least Flink >= 1.5 and also with Flink 1.4.

Cheers,
Till  

On Tue, Aug 21, 2018 at 10:20 AM Martin Eden <[hidden email]> wrote:
Hi guys,

Just to close the loop, with the Flink 1.3.2 cli you have to provide the Flink Job Manager host address in order to submit a job like so:
${FLINK_HOME}/bin/flink run -d -m ${FLINK_JOBMANAGER_ADDRESS} ${JOB_JAR}

Since we are running the DCOS Flink package we use the Marathon rest api to fetch the FLINK_JOBMANAGER_ADDRESS which solved our problem.

We are now thinking of upgrading to the latest 1.6 release. From looking at the cli docs and from the previous messages it seems you still need to provide the Job Manager address explicitly. Are there any plans to support job submission that just takes a zookeeper ensemble and zookeeperNamespace (which is currently accepted) without having to provide explicit Job Manager address? This would be more user friendly and would eliminate the extra step of figuring out the Job Manager address.

Thanks,
M



On Tue, Jul 31, 2018 at 3:54 PM, Till Rohrmann <[hidden email]> wrote:
I think that the web ui automatically redirects to the current leader. So if you should access the JobManager which is not leader, then you should get an HTTP redirect to the current leader. Due to that it should not be strictly necessary to know which of the JobManagers is the leader.

The RestClusterClient uses the ZooKeeperLeaderRetrievalService to retrieve the leader address. You could try the same. Using the RestClusterClient with Flink 1.4 won't work, though. Alternatively, you should be able to directly read the address from the leader ZNode in ZooKeeper.

Cheers,
Till



On Thu, Jul 26, 2018 at 4:14 AM vino yang <[hidden email]> wrote:
Hi Youjun,

Thanks, you can try this but I am not sure if it works correctly. Because for the REST Client, there are quite a few changes from 1.4 to 1.5.

Maybe you can customize the source code in 1.4 refer to specific implementation of 1.5? Another option, upgrade your Flink version.

To Chesnay and Till:  any suggestion or opinion?

Thanks, vino.

2018-07-26 10:01 GMT+08:00 Yuan,Youjun <[hidden email]>:

Thanks for the information. Forgot to mention, I am using Flink 1.4, the RestClusterClient seems don’t have the ability to retrieve the leader address. I did notice there is webMonitorRetrievalService member in Flink 1.5.

 

I wonder if I can use [hidden email] on my client side, to retrieve the leader JM of Flink v1.4 Cluster.

 

Thanks

Youjun

 

发件人: vino yang <[hidden email]>
发送时间: Wednesday, July 25, 2018 7:11 PM
收件人: Martin Eden <[hidden email]>
抄送: Yuan,Youjun <[hidden email]>; [hidden email]
主题: Re: Best way to find the current alive jobmanager with HA mode zookeeper

 

Hi Martin,

 

 

For a standalone cluster which exists multiple JM instances, If you do not use Rest API, but use Flink provided Cluster client. The client can perceive which one this the JM leader from multiple JM instances.

 

For example, you can use CLI to submit flink job in a non-Leader node.

 

But I did not verify this case for Flink on Mesos.

 

Thanks, vino.

 

2018-07-25 17:22 GMT+08:00 Martin Eden <[hidden email]>:

Hi,

 

This is actually very relevant to us as well.

 

We want to deploy Flink 1.3.2 on a 3 node DCOS cluster. In the case of Mesos/DCOS, Flink HA runs only one JobManager which gets restarted on another node by Marathon in case of failure and re-load it's state from Zookeeper.

 

Yuan I am guessing you are using Flink in standalone mode and there it is actually running 3 instances of the Job Manager, 1 active and 2 stand-bys.

 

Either way, in both cases there is the need to "discover" the hostname and port of the Job Manager at runtime. This is needed when you want to use the cli to submit jobs for instance. Is there an elegant mode to submit jobs other than say just trying out all the possible nodes in your cluster?

 

Grateful if anyone could clarify any of the above, thanks,

M

 

On Wed, Jul 25, 2018 at 11:37 AM, Yuan,Youjun <[hidden email]> wrote:

Hi all,

 

I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is alive, or the host of current leader?  We don’t want to access a dead JM.

 

Thanks.

Youjun Yuan