(DEPRECATED) Apache Flink User Mailing List archive.

Taskmanager killed often after migrating to flink 1.12

Classic

List

Threaded

4 messages Options

Sambaran

Taskmanager killed often after migrating to flink 1.12

Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:

INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints

org.apache.flink.runtime.checkpoint.CheckpointsCleaner [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards

Sambaran

Till Rohrmann

Re: Taskmanager killed often after migrating to flink 1.12

Hi Sambaran,

could you also share the cause why the checkpoints could not be discarded with us?

With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could maybe check whether your resource management system logs that some containers/pods are exceeding their memory limitations. If this is the case, then you should give your Flink processes a bit more memory [1].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup.html

Cheers,

Till

On Tue, Apr 27, 2021 at 6:48 PM Sambaran <[hidden email]> wrote:

Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:
INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints
org.apache.flink.runtime.checkpoint.CheckpointsCleaner [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards
Sambaran

Sambaran

Re: Taskmanager killed often after migrating to flink 1.12

Hi Till,

Thank you for the response, we are currently running flink with an increased memory usage, so far the taskmanager is working fine, we will check if there is any further issue and will update you.

Regards

Sambaran

On Wed, Apr 28, 2021 at 5:33 PM Till Rohrmann <[hidden email]> wrote:

Hi Sambaran,

could you also share the cause why the checkpoints could not be discarded with us?

With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could maybe check whether your resource management system logs that some containers/pods are exceeding their memory limitations. If this is the case, then you should give your Flink processes a bit more memory [1].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup.html

Cheers,
Till

On Tue, Apr 27, 2021 at 6:48 PM Sambaran <[hidden email]> wrote:
Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:
INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints
org.apache.flink.runtime.checkpoint.CheckpointsCleaner [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards
Sambaran

Till Rohrmann

Re: Taskmanager killed often after migrating to flink 1.12

Great, thanks for the update.

On Wed, Apr 28, 2021 at 7:08 PM Sambaran <[hidden email]> wrote:

Hi Till,

Thank you for the response, we are currently running flink with an increased memory usage, so far the taskmanager is working fine, we will check if there is any further issue and will update you.

Regards
Sambaran

On Wed, Apr 28, 2021 at 5:33 PM Till Rohrmann <[hidden email]> wrote:
Hi Sambaran,

could you also share the cause why the checkpoints could not be discarded with us?

With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could maybe check whether your resource management system logs that some containers/pods are exceeding their memory limitations. If this is the case, then you should give your Flink processes a bit more memory [1].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup.html

Cheers,
Till

On Tue, Apr 27, 2021 at 6:48 PM Sambaran <[hidden email]> wrote:
Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:
INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints
org.apache.flink.runtime.checkpoint.CheckpointsCleaner [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards
Sambaran