(DEPRECATED) Apache Flink User Mailing List archive.

Flink on Kubernetes unable to Recover from failure

Classic

List

Threaded

3 messages Options

Geldenhuys, Morgan Karl

Flink on Kubernetes unable to Recover from failure

Community,

I am currently doing some fault tolerance testing for Flink (1.10) running on Kubernetes (1.18) and am encountering an error where after a running job experiences a failure, the job fails completely.

A Flink session cluster has been created according to the documentation contained here: https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html. The job is then uploaded and deployed via the web interface and everything runs smoothly. The job has a parallelism of 24 with 3 worker nodes as fail overs in reserve. Each worker is assigned 1 task slot each (total of 27).

The next step would be inject an error for which I use the Pumba Chaos Testing tool (https://github.com/alexei-led/pumba) to pause a random worker process. This selection and pausing is done manually for the moment.

Looking at the error logs, Flink does detect the error after the timeout (The heartbeat timeout has been set to 20 seconds):

java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 768848f91ebdbccc8d518e910160414d timed out.

After the failure has been detected, the system resets to the latest saved checkpoint and restarts. The system catches up nicely and resumes normal processing... however, after about 3 minutes, the following error occurs:

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.45.128.1:6121'. This might indicate that the remote task manager was lost.

The job fails, and is unable to restart because the number of task slots has been reduced to zero. Looking at the kubernetes cluster, all containers are running...

Has anyone else run into this error? What am I missing? The same thing happens when the containers are deleted.

Regards,
M.

rmetzger0

Re: Flink on Kubernetes unable to Recover from failure

Hey Morgan,

Is it possible for you to provide us with the full logs of the JobManager and the affected TaskManager?

This might give us a hint why the number of task slots is zero.

Best,

Robert

On Tue, May 5, 2020 at 11:41 AM Morgan Geldenhuys <[hidden email]> wrote:

Community,

I am currently doing some fault tolerance testing for Flink (1.10) running on Kubernetes (1.18) and am encountering an error where after a running job experiences a failure, the job fails completely.

A Flink session cluster has been created according to the documentation contained here: https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html. The job is then uploaded and deployed via the web interface and everything runs smoothly. The job has a parallelism of 24 with 3 worker nodes as fail overs in reserve. Each worker is assigned 1 task slot each (total of 27).

The next step would be inject an error for which I use the Pumba Chaos Testing tool (https://github.com/alexei-led/pumba) to pause a random worker process. This selection and pausing is done manually for the moment.

Looking at the error logs, Flink does detect the error after the timeout (The heartbeat timeout has been set to 20 seconds):

java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 768848f91ebdbccc8d518e910160414d timed out.

After the failure has been detected, the system resets to the latest saved checkpoint and restarts. The system catches up nicely and resumes normal processing... however, after about 3 minutes, the following error occurs:

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.45.128.1:6121'. This might indicate that the remote task manager was lost.

The job fails, and is unable to restart because the number of task slots has been reduced to zero. Looking at the kubernetes cluster, all containers are running...

Has anyone else run into this error? What am I missing? The same thing happens when the containers are deleted.

Regards,
M.

Yun Tang

Re: Flink on Kubernetes unable to Recover from failure

Hi Morgan

If "because the number of task slots has been reduced to zero", do you mean the total task slots reduced to 0? And how many registered task managers could you see when this happened (you could click to the "Task Managers" tab to view related information).

All containers running do not mean they're all registered to the job manager, I think you could refer to the JM and TM log to see whether the register connection is lost.

Best