Taskmanager killed often after migrating to flink 1.12

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Taskmanager killed often after migrating to flink 1.12

Sambaran
Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:
INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints
org.apache.flink.runtime.checkpoint.CheckpointsCleaner       [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards
Sambaran
Reply | Threaded
Open this post in threaded view
|

Re: Taskmanager killed often after migrating to flink 1.12

Till Rohrmann
Hi Sambaran,

could you also share the cause why the checkpoints could not be discarded with us?

With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could maybe check whether your resource management system logs that some containers/pods are exceeding their memory limitations. If this is the case, then you should give your Flink processes a bit more memory [1].


Cheers,
Till

On Tue, Apr 27, 2021 at 6:48 PM Sambaran <[hidden email]> wrote:
Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:
INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints
org.apache.flink.runtime.checkpoint.CheckpointsCleaner       [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards
Sambaran
Reply | Threaded
Open this post in threaded view
|

Re: Taskmanager killed often after migrating to flink 1.12

Sambaran
Hi Till,

Thank you for the response, we are currently running flink with an increased memory usage, so far the taskmanager is working fine, we will check if there is any further issue and will update you.

Regards
Sambaran

On Wed, Apr 28, 2021 at 5:33 PM Till Rohrmann <[hidden email]> wrote:
Hi Sambaran,

could you also share the cause why the checkpoints could not be discarded with us?

With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could maybe check whether your resource management system logs that some containers/pods are exceeding their memory limitations. If this is the case, then you should give your Flink processes a bit more memory [1].


Cheers,
Till

On Tue, Apr 27, 2021 at 6:48 PM Sambaran <[hidden email]> wrote:
Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:
INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints
org.apache.flink.runtime.checkpoint.CheckpointsCleaner       [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards
Sambaran
Reply | Threaded
Open this post in threaded view
|

Re: Taskmanager killed often after migrating to flink 1.12

Till Rohrmann
Great, thanks for the update. 

On Wed, Apr 28, 2021 at 7:08 PM Sambaran <[hidden email]> wrote:
Hi Till,

Thank you for the response, we are currently running flink with an increased memory usage, so far the taskmanager is working fine, we will check if there is any further issue and will update you.

Regards
Sambaran

On Wed, Apr 28, 2021 at 5:33 PM Till Rohrmann <[hidden email]> wrote:
Hi Sambaran,

could you also share the cause why the checkpoints could not be discarded with us?

With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could maybe check whether your resource management system logs that some containers/pods are exceeding their memory limitations. If this is the case, then you should give your Flink processes a bit more memory [1].


Cheers,
Till

On Tue, Apr 27, 2021 at 6:48 PM Sambaran <[hidden email]> wrote:
Hi there,

We have recently migrated to flink 1.12 from 1.7, although the jobs are running fine, sometimes the task manager is getting killed (much frequently 2-3 times a day).

Logs:
INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

While checking more logs we see flink not able to discard old checkpoints
org.apache.flink.runtime.checkpoint.CheckpointsCleaner       [] - Could not discard completed checkpoint 173.

We are not sure of what is the reason here, has anyone faced this before?

Regards
Sambaran