Job leak in attached mode (batch scenario)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Job leak in attached mode (batch scenario)

qi luo
Hi guys,

We runs thousands of Flink batch job everyday. The batch jobs are submitted in attached mode, so we can know from the client when the job finished and then take further actions. To respond to user abort actions, we submit the jobs with "—shutdownOnAttachedExit” so the Flink cluster can be shutdown when the client exits.

However, in some cases when the Flink client exists abnormally (such as OOM), the shutdown signal will not be sent to Flink cluster, causing the “job leak”. The lingering Flink job will continue to run and never ends, consuming large amount of resources and even produce unexpected results.

Does Flink has any mechanism to handle such scenario (e.g. Spark has cluster mode, where the driver runs in the client side, so the job will exit when client exits)? Any idea will be very appreciated!

Thanks,
Qi
Reply | Threaded
Open this post in threaded view
|

Re:Job leak in attached mode (batch scenario)

Haibo Sun
Hi, Qi

As far as I know, there is no such mechanism now. To achieve this, I think it may be necessary to add a REST-based heartbeat mechanism between Dispatcher and Client. At present, perhaps you can add a monitoring service to deal with these residual Flink clusters.

Best,
Haibo

At 2019-07-16 14:42:37, "qi luo" <[hidden email]> wrote:
Hi guys,

We runs thousands of Flink batch job everyday. The batch jobs are submitted in attached mode, so we can know from the client when the job finished and then take further actions. To respond to user abort actions, we submit the jobs with "—shutdownOnAttachedExit” so the Flink cluster can be shutdown when the client exits.

However, in some cases when the Flink client exists abnormally (such as OOM), the shutdown signal will not be sent to Flink cluster, causing the “job leak”. The lingering Flink job will continue to run and never ends, consuming large amount of resources and even produce unexpected results.

Does Flink has any mechanism to handle such scenario (e.g. Spark has cluster mode, where the driver runs in the client side, so the job will exit when client exits)? Any idea will be very appreciated!

Thanks,
Qi
Reply | Threaded
Open this post in threaded view
|

Re: Job leak in attached mode (batch scenario)

qi luo
Thanks Haibo for the response!

Is there any community issue or plan to implement heartbeat mechanism between Dispatcher and Client? If not, should I create one?

Regards,
Qi

On Jul 17, 2019, at 10:19 AM, Haibo Sun <[hidden email]> wrote:

Hi, Qi

As far as I know, there is no such mechanism now. To achieve this, I think it may be necessary to add a REST-based heartbeat mechanism between Dispatcher and Client. At present, perhaps you can add a monitoring service to deal with these residual Flink clusters.

Best,
Haibo

At 2019-07-16 14:42:37, "qi luo" <[hidden email]> wrote:
Hi guys,

We runs thousands of Flink batch job everyday. The batch jobs are submitted in attached mode, so we can know from the client when the job finished and then take further actions. To respond to user abort actions, we submit the jobs with "—shutdownOnAttachedExit” so the Flink cluster can be shutdown when the client exits.

However, in some cases when the Flink client exists abnormally (such as OOM), the shutdown signal will not be sent to Flink cluster, causing the “job leak”. The lingering Flink job will continue to run and never ends, consuming large amount of resources and even produce unexpected results.

Does Flink has any mechanism to handle such scenario (e.g. Spark has cluster mode, where the driver runs in the client side, so the job will exit when client exits)? Any idea will be very appreciated!

Thanks,
Qi

Reply | Threaded
Open this post in threaded view
|

Re:Re: Job leak in attached mode (batch scenario)

Haibo Sun

There should be no JIRA about the requirement. If you have a strong need for this feature, you can create one. In addition, you can also go to issues.apache.org and search with keywords to confirm whether there are the relevant JIRA.

Best,
Haibo

At 2019-07-18 10:31:22, "qi luo" <[hidden email]> wrote:
Thanks Haibo for the response!

Is there any community issue or plan to implement heartbeat mechanism between Dispatcher and Client? If not, should I create one?

Regards,
Qi

On Jul 17, 2019, at 10:19 AM, Haibo Sun <[hidden email]> wrote:

Hi, Qi

As far as I know, there is no such mechanism now. To achieve this, I think it may be necessary to add a REST-based heartbeat mechanism between Dispatcher and Client. At present, perhaps you can add a monitoring service to deal with these residual Flink clusters.

Best,
Haibo

At 2019-07-16 14:42:37, "qi luo" <[hidden email]> wrote:
Hi guys,

We runs thousands of Flink batch job everyday. The batch jobs are submitted in attached mode, so we can know from the client when the job finished and then take further actions. To respond to user abort actions, we submit the jobs with "—shutdownOnAttachedExit” so the Flink cluster can be shutdown when the client exits.

However, in some cases when the Flink client exists abnormally (such as OOM), the shutdown signal will not be sent to Flink cluster, causing the “job leak”. The lingering Flink job will continue to run and never ends, consuming large amount of resources and even produce unexpected results.

Does Flink has any mechanism to handle such scenario (e.g. Spark has cluster mode, where the driver runs in the client side, so the job will exit when client exits)? Any idea will be very appreciated!

Thanks,
Qi