(DEPRECATED) Apache Flink User Mailing List archive.

How to get automatic fail over working in Flink

Classic

List

Threaded

4 messages Options

James Isaac

How to get automatic fail over working in Flink

This question has been asked on StackOverflow:

https://stackoverflow.com/questions/48262080/how-to-get-automatic-fail-over-working-in-flink

I am using Apache Flink 1.4 on a cluster of 3 machines, out of which one is the JobManager and the other 2 host TaskManagers.

I start flink in cluster mode and submit a flink job. I have configured 24 task slots in the flink config, and for the job I use 6 task slots.

When I submit the job, I see 3 tasks are assigned to Worker machine 1 and 3 are assigned to Worker machine 2. Now, when I kill the TaskManager on WorkerMachine 2, I see that the entire job fails.

Is this the expected behaviour, or does it have automatic failover as in Spark.

Do we need to use YARN/Mesos to achieve automatic failover?

We tried the Restart Strategy, but when it restarts we get an exception saying that no task slots are available and then the job fails. We think that 24 slots is enough to take over. What could we be doing wrong here?

Regards,

James

Nico Kruber

Re: How to get automatic fail over working in Flink

Hi James,
In this scenario, with the restart strategy set, the job should restart
(without YARN/Mesos) as long as you have enough slots available.

Can you check with the web interface on http://<jobmanager>:8081/ that
enough slots are available after killing one TaskManager?

Can you provide JobManager and TaskManager logs and some more details on
the job you are running?

Nico

On 16/01/18 07:04, Data Engineer wrote:

> This question has been asked on StackOverflow:
> https://stackoverflow.com/questions/48262080/how-to-get-automatic-fail-over-working-in-flink
>
> I am using Apache Flink 1.4 on a cluster of 3 machines, out of which one
> is the JobManager and the other 2 host TaskManagers.
>
> I start flink in cluster mode and submit a flink job. I have configured
> 24 task slots in the flink config, and for the job I use 6 task slots.
>
> When I submit the job, I see 3 tasks are assigned to Worker machine 1
> and 3 are assigned to Worker machine 2. Now, when I kill the TaskManager
> on WorkerMachine 2, I see that the entire job fails.
>
> Is this the expected behaviour, or does it have automatic failover as in
> Spark.
>
> Do we need to use YARN/Mesos to achieve automatic failover?
>
> We tried the Restart Strategy, but when it restarts we get an exception
> saying that no task slots are available and then the job fails. We think
> that 24 slots is enough to take over. What could we be doing wrong here?
>
> Regards,
> James

signature.asc (201 bytes) Download Attachment

James Isaac

Re: How to get automatic fail over working in Flink

Hi Nico,

Thank you for your reply.

I have configured each TaskManager with 24 available task slots. When both TaskManagers were running, I could see that a total of 8 task slots were being used.

I also see that 24 task slots are available after one TaskManager goes down. I don't see any exception regarding available task slots

However, I get a java.net.ConnectException each time the JobManager tries to connect to the TaskManager that I have killed. It retries 3 times(the number I have set) and then the job fails.

I expect the JobManager to move the workload to the remaining machine on which TaskManager is running. Or does it expect both TaskManagers to be up by the time it restarts?

Regards,

James

On Tue, Jan 16, 2018 at 3:02 PM, Nico Kruber <[hidden email]> wrote:

Hi James,
In this scenario, with the restart strategy set, the job should restart
(without YARN/Mesos) as long as you have enough slots available.

Can you check with the web interface on http://<jobmanager>:8081/ that
enough slots are available after killing one TaskManager?

Can you provide JobManager and TaskManager logs and some more details on
the job you are running?

Nico

On 16/01/18 07:04, Data Engineer wrote:
> This question has been asked on StackOverflow:
> https://stackoverflow.com/questions/48262080/how-to-get-automatic-fail-over-working-in-flink
>
> I am using Apache Flink 1.4 on a cluster of 3 machines, out of which one
> is the JobManager and the other 2 host TaskManagers.
>
> I start flink in cluster mode and submit a flink job. I have configured
> 24 task slots in the flink config, and for the job I use 6 task slots.
>
> When I submit the job, I see 3 tasks are assigned to Worker machine 1
> and 3 are assigned to Worker machine 2. Now, when I kill the TaskManager
> on WorkerMachine 2, I see that the entire job fails.
>
> Is this the expected behaviour, or does it have automatic failover as in
> Spark.
>
> Do we need to use YARN/Mesos to achieve automatic failover?
>
> We tried the Restart Strategy, but when it restarts we get an exception
> saying that no task slots are available and then the job fails. We think
> that 24 slots is enough to take over. What could we be doing wrong here?
>
> Regards,
> James

flink-root-jobmanager-11-host-aa-bbb-cc-52.log (160K) Download Attachment

Fabian Hueske-2

Re: How to get automatic fail over working in Flink

Hi James,

did you configure checkpointing [1] and a recovery strategy [2] for your job?

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/checkpointing.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/restart_strategies.html

2018-01-17 8:10 GMT+01:00 Data Engineer <[hidden email]>:

Hi Nico,

Thank you for your reply.

I have configured each TaskManager with 24 available task slots. When both TaskManagers were running, I could see that a total of 8 task slots were being used.
I also see that 24 task slots are available after one TaskManager goes down. I don't see any exception regarding available task slots

However, I get a java.net.ConnectException each time the JobManager tries to connect to the TaskManager that I have killed. It retries 3 times(the number I have set) and then the job fails.
I expect the JobManager to move the workload to the remaining machine on which TaskManager is running. Or does it expect both TaskManagers to be up by the time it restarts?

Regards,
James

On Tue, Jan 16, 2018 at 3:02 PM, Nico Kruber <[hidden email]> wrote:
Hi James,
In this scenario, with the restart strategy set, the job should restart
(without YARN/Mesos) as long as you have enough slots available.

Can you check with the web interface on http://<jobmanager>:8081/ that
enough slots are available after killing one TaskManager?

Can you provide JobManager and TaskManager logs and some more details on
the job you are running?

Nico

On 16/01/18 07:04, Data Engineer wrote:
> This question has been asked on StackOverflow:
> https://stackoverflow.com/questions/48262080/how-to-get-automatic-fail-over-working-in-flink
>
> I am using Apache Flink 1.4 on a cluster of 3 machines, out of which one
> is the JobManager and the other 2 host TaskManagers.
>
> I start flink in cluster mode and submit a flink job. I have configured
> 24 task slots in the flink config, and for the job I use 6 task slots.
>
> When I submit the job, I see 3 tasks are assigned to Worker machine 1
> and 3 are assigned to Worker machine 2. Now, when I kill the TaskManager
> on WorkerMachine 2, I see that the entire job fails.
>
> Is this the expected behaviour, or does it have automatic failover as in
> Spark.
>
> Do we need to use YARN/Mesos to achieve automatic failover?
>
> We tried the Restart Strategy, but when it restarts we get an exception
> saying that no task slots are available and then the job fails. We think
> that 24 slots is enough to take over. What could we be doing wrong here?
>
> Regards,
> James