(DEPRECATED) Apache Flink User Mailing List archive.

How to restart/recover on reboot?

Classic

List

Threaded

12 messages Options

John Smith

How to restart/recover on reboot?

The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.

2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?

3- Or do we need to create systemd services and if so on which command do we start the service on?

John Smith

Re: How to restart/recover on reboot?

I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:

The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

Till Rohrmann

Re: How to restart/recover on reboot?

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.

2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.

3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,

Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:

I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

John Smith

Re: How to restart/recover on reboot?

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:
I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

Till Rohrmann

Re: How to restart/recover on reboot?

When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,

Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:
Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:
I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

John Smith

Re: How to restart/recover on reboot?

Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <[hidden email]> wrote:

When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,
Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:
Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:
Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:
I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

Till Rohrmann

Re: How to restart/recover on reboot?

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,

Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <[hidden email]> wrote:

Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <[hidden email]> wrote:
When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,
Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:
Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:
Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:
I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

PoolakkalMukkath, Shakir

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

We can only start a standalone HA cluster with ./start-cluster.sh script. And then when there are failures, we can restart those components individually by calling jobmanager.sh/ jobmanager.sh. This works great

But , Like John mentioned, If we want to start the cluster initially itself by running the jobmanager.sh on each JobManager nodes, it is not working. It binds to local and not forming the HA cluster.

Thanks,

Shakir

From: Till Rohrmann <[hidden email]>
Date: Tuesday, June 18, 2019 at 4:23 PM
To: John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: [EXTERNAL] Re: How to restart/recover on reboot?

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,

Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <[hidden email]> wrote:

Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <[hidden email]> wrote:

When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,

Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.

2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.

3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,

Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:

I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:

The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.

2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?

3- Or do we need to create systemd services and if so on which command do we start the service on?

Martin, Nick-2

RE: [EXTERNAL] Re: How to restart/recover on reboot?

Jobmanager.sh takes an optional argument for the hostname to bind to, and start-cluster uses it. If you leave it blank it, the script will use whatever is in flink-conf.yaml (localhost is the default value that ships with flink).

The dockerized version of flink runs pretty much the way you’re trying to operate (i.e. each node starts itself), so the entrypoint script out of that is probably a good source of information about how to set it up.

From: PoolakkalMukkath, Shakir [mailto:[hidden email]]
Sent: Tuesday, June 18, 2019 2:15 PM
To: Till Rohrmann <[hidden email]>; John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

Thanks,

Shakir

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,

Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <[hidden email]> wrote:

Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <[hidden email]> wrote:

When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,

Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.

2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.

3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,

Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:

I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:

The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.

2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?

3- Or do we need to create systemd services and if so on which command do we start the service on?

Notice: This e-mail is intended solely for use of the individual or entity to which it is addressed and may contain information that is proprietary, privileged and/or exempt from disclosure under applicable law. If the reader is not the intended recipient or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. This communication may also contain data subject to U.S. export laws. If so, data subject to the International Traffic in Arms Regulation cannot be disseminated, distributed, transferred, or copied, whether incorporated or in its original form, to foreign nationals residing in the U.S. or abroad, absent the express prior approval of the U.S. Department of State. Data subject to the Export Administration Act may not be disseminated, distributed, transferred or copied contrary to U. S. Department of Commerce regulations. If you have received this communication in error, please notify the sender by reply e-mail and destroy the e-mail message and any physical copies made of the communication.
Thank you.
*********************

PoolakkalMukkath, Shakir

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Nick,

It works that way by explicitly setting the –host. I got mislead by the “only” word in doc and did not try. Thanks for the help

Thanks,

Shakir

From: "Martin, Nick" <[hidden email]>
Date: Tuesday, June 18, 2019 at 6:31 PM
To: "PoolakkalMukkath, Shakir" <[hidden email]>, Till Rohrmann <[hidden email]>, John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: RE: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

Thanks,

Shakir

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,

Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <[hidden email]> wrote:

Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <[hidden email]> wrote:

When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,

Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.

2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.

3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,

Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:

I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:

The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.

2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?

3- Or do we need to create systemd services and if so on which command do we start the service on?

John Smith

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Ah ok we need to pass --host. The command line help sais jobmanager.sh <host>?!?! If I recall. I have to go check tomorrow...

On Tue., Jun. 18, 2019, 10:05 p.m. PoolakkalMukkath, Shakir, <[hidden email]> wrote:

Hi Nick,

It works that way by explicitly setting the –host. I got mislead by the “only” word in doc and did not try. Thanks for the help

Thanks,

Shakir

From: "Martin, Nick" <[hidden email]>
Date: Tuesday, June 18, 2019 at 6:31 PM
To: "PoolakkalMukkath, Shakir" <[hidden email]>, Till Rohrmann <[hidden email]>, John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: RE: [EXTERNAL] Re: How to restart/recover on reboot?

Jobmanager.sh takes an optional argument for the hostname to bind to, and start-cluster uses it. If you leave it blank it, the script will use whatever is in flink-conf.yaml (localhost is the default value that ships with flink).

The dockerized version of flink runs pretty much the way you’re trying to operate (i.e. each node starts itself), so the entrypoint script out of that is probably a good source of information about how to set it up.

From: PoolakkalMukkath, Shakir [mailto:[hidden email]]
Sent: Tuesday, June 18, 2019 2:15 PM
To: Till Rohrmann <[hidden email]>; John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

We can only start a standalone HA cluster with ./start-cluster.sh script. And then when there are failures, we can restart those components individually by calling jobmanager.sh/ jobmanager.sh. This works great

But , Like John mentioned, If we want to start the cluster initially itself by running the jobmanager.sh on each JobManager nodes, it is not working. It binds to local and not forming the HA cluster.

Thanks,

Shakir

From: Till Rohrmann <[hidden email]>
Date: Tuesday, June 18, 2019 at 4:23 PM
To: John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: [EXTERNAL] Re: How to restart/recover on reboot?

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,

Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <[hidden email]> wrote:

Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <[hidden email]> wrote:

When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,

Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.

2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.

3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,

Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:

I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:

The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.

2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?

3- Or do we need to create systemd services and if so on which command do we start the service on?

Notice: This e-mail is intended solely for use of the individual or entity to which it is addressed and may contain information that is proprietary, privileged and/or exempt from disclosure under applicable law. If the reader is not the intended recipient or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. This communication may also contain data subject to U.S. export laws. If so, data subject to the International Traffic in Arms Regulation cannot be disseminated, distributed, transferred, or copied, whether incorporated or in its original form, to foreign nationals residing in the U.S. or abroad, absent the express prior approval of the U.S. Department of State. Data subject to the Export Administration Act may not be disseminated, distributed, transferred or copied contrary to U. S. Department of Commerce regulations. If you have received this communication in error, please notify the sender by reply e-mail and destroy the e-mail message and any physical copies made of the communication.
Thank you.
*********************

John Smith

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Ok I tried it works! I can setup my cluster with terraform and enable systemd services! i think I got confused when I looked and it was doing leader election because all service came up quick!

On Tue, 18 Jun 2019 at 22:35, John Smith <[hidden email]> wrote:

Ah ok we need to pass --host. The command line help sais jobmanager.sh <host>?!?! If I recall. I have to go check tomorrow...

On Tue., Jun. 18, 2019, 10:05 p.m. PoolakkalMukkath, Shakir, <[hidden email]> wrote:

Hi Nick,

It works that way by explicitly setting the –host. I got mislead by the “only” word in doc and did not try. Thanks for the help

Thanks,

Shakir

From: "Martin, Nick" <[hidden email]>
Date: Tuesday, June 18, 2019 at 6:31 PM
To: "PoolakkalMukkath, Shakir" <[hidden email]>, Till Rohrmann <[hidden email]>, John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: RE: [EXTERNAL] Re: How to restart/recover on reboot?

Jobmanager.sh takes an optional argument for the hostname to bind to, and start-cluster uses it. If you leave it blank it, the script will use whatever is in flink-conf.yaml (localhost is the default value that ships with flink).

The dockerized version of flink runs pretty much the way you’re trying to operate (i.e. each node starts itself), so the entrypoint script out of that is probably a good source of information about how to set it up.

From: PoolakkalMukkath, Shakir [mailto:[hidden email]]
Sent: Tuesday, June 18, 2019 2:15 PM
To: Till Rohrmann <[hidden email]>; John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

We can only start a standalone HA cluster with ./start-cluster.sh script. And then when there are failures, we can restart those components individually by calling jobmanager.sh/ jobmanager.sh. This works great

But , Like John mentioned, If we want to start the cluster initially itself by running the jobmanager.sh on each JobManager nodes, it is not working. It binds to local and not forming the HA cluster.

Thanks,

Shakir

From: Till Rohrmann <[hidden email]>
Date: Tuesday, June 18, 2019 at 4:23 PM
To: John Smith <[hidden email]>
Cc: user <[hidden email]>
Subject: [EXTERNAL] Re: How to restart/recover on reboot?

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,

Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <[hidden email]> wrote:

Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <[hidden email]> wrote:

When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,

Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <[hidden email]> wrote:

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <[hidden email]> wrote:

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.

2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.

3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,

Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <[hidden email]> wrote:

I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <[hidden email]> wrote:

The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.

2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?

3- Or do we need to create systemd services and if so on which command do we start the service on?

Notice: This e-mail is intended solely for use of the individual or entity to which it is addressed and may contain information that is proprietary, privileged and/or exempt from disclosure under applicable law. If the reader is not the intended recipient or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. This communication may also contain data subject to U.S. export laws. If so, data subject to the International Traffic in Arms Regulation cannot be disseminated, distributed, transferred, or copied, whether incorporated or in its original form, to foreign nationals residing in the U.S. or abroad, absent the express prior approval of the U.S. Department of State. Data subject to the Export Administration Act may not be disseminated, distributed, transferred or copied contrary to U. S. Department of Commerce regulations. If you have received this communication in error, please notify the sender by reply e-mail and destroy the e-mail message and any physical copies made of the communication.
Thank you.
*********************