Stop job with savepoint during graceful shutdown on a k8s cluster

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Stop job with savepoint during graceful shutdown on a k8s cluster

shravan
Job Manager , Task Manager  are run as separate pods within K8S cluster in
our setup. As job cluster is not used, job jars are not part of Job Manager
docker image. The job is submitted from a different Flink client pod. Flink
is configured with RocksDB state backend. The docker images are created by
us as the base OS image needs to be compliant to our organization
guidelines.

We are looking for a reliable approach to stop the job with savepoint during
graceful shutdown to avoid duplicates on restart.
The Job Manager pod traps shutdown signal and stops all the jobs with
savepoints. The Flink client pod starts the job with savepoint on restart of
client pod. But as the order in which pods will be shutdown is not
predictable, we have following queries,
1. Our understanding is to stop job with savepoint, all the task manager
will persist their state during savepoint. If a Task Manager receives a
shutdown signal while savepoint is being taken, does it complete the
savepoint before shutdown ?
2. The job manager K8S service is configured as remote job manager address
in Task Manager. This service may not be available during savepoint,  will
this affect the communication between Task Manager and Job Manager during
savepoint ?

Can you provide some pointers on the internals of savepoint in Flink ?




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Stop job with savepoint during graceful shutdown on a k8s cluster

Vijay Bhaskar
Please find answers inline

Our understanding is to stop job with savepoint, all the task manager
will persist their state during savepoint. If a Task Manager receives a
shutdown signal while savepoint is being taken, does it complete the
savepoint before shutdown ?
[Ans ] Why task manager is shutdown suddenly? Are you saying about handling unpredictable shutdown while taking
savepoint? In that case You can also use retained check point. In case current checkpoint has issues because of shutdown
you will have previous checkpoint. So that you can use it. Now you will have 2 options, either savepoint/checkpoint. One of them
will always be available.

The job manager K8S service is configured as remote job manager address
in Task Manager. This service may not be available during savepoint,  will
this affect the communication between Task Manager and Job Manager during
savepoint ?
[Ans] you can go for HA right? Where you can run more than one jobmanager so that one is always service is available




On Fri, Mar 13, 2020 at 2:40 PM shravan <[hidden email]> wrote:
Job Manager , Task Manager  are run as separate pods within K8S cluster in
our setup. As job cluster is not used, job jars are not part of Job Manager
docker image. The job is submitted from a different Flink client pod. Flink
is configured with RocksDB state backend. The docker images are created by
us as the base OS image needs to be compliant to our organization
guidelines.

We are looking for a reliable approach to stop the job with savepoint during
graceful shutdown to avoid duplicates on restart.
The Job Manager pod traps shutdown signal and stops all the jobs with
savepoints. The Flink client pod starts the job with savepoint on restart of
client pod. But as the order in which pods will be shutdown is not
predictable, we have following queries,
1.      Our understanding is to stop job with savepoint, all the task manager
will persist their state during savepoint. If a Task Manager receives a
shutdown signal while savepoint is being taken, does it complete the
savepoint before shutdown ?
2.      The job manager K8S service is configured as remote job manager address
in Task Manager. This service may not be available during savepoint,  will
this affect the communication between Task Manager and Job Manager during
savepoint ?

Can you provide some pointers on the internals of savepoint in Flink ?




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Stop job with savepoint during graceful shutdown on a k8s cluster

shravan
Our understanding is to stop job with savepoint, all the task manager
will persist their state during savepoint. If a Task Manager receives a
shutdown signal while savepoint is being taken, does it complete the
savepoint before shutdown ?
[Ans ] Why task manager is shutdown suddenly? Are you saying about handling
unpredictable shutdown while taking
savepoint? In that case You can also use retained check point. In case
current checkpoint has issues because of shutdown
you will have previous checkpoint. So that you can use it. Now you will have
2 options, either savepoint/checkpoint. One of them
will always be available.
*[Followup Question]* When the processing service is shutdown say for
maintenance as it is a graceful shutdown we are looking at means to avoid
duplicates as exactly once message processing is guaranteed by our service .
We are already starting the job based on checkpoint or savepoint whichever
is the latest. When the job is started from last good checkpoint it results
in duplicates.

The job manager K8S service is configured as remote job manager address
in Task Manager. This service may not be available during savepoint,  will
this affect the communication between Task Manager and Job Manager during
savepoint ?
[Ans] you can go for HA right? Where you can run more than one jobmanager so
that one is always service is available
*[Followup Question]* As we mentioned above processing service is shut down
for maintenance.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Stop job with savepoint during graceful shutdown on a k8s cluster

Vijay Bhaskar
For point (1) above:
Its up to user to have proper sink and source to choose to have exactly once semantics as per the documentation:
If we choose the supported source and sink combinations duplicates will be avoided.

For point (2)
If the communication breaks across Job manager and task manager during the save point or checkpoint operation, 
checkpoint/save point will be declined. We can't have them

Regards
Bhaskar

On Sat, Mar 14, 2020 at 4:54 PM shravan <[hidden email]> wrote:
Our understanding is to stop job with savepoint, all the task manager
will persist their state during savepoint. If a Task Manager receives a
shutdown signal while savepoint is being taken, does it complete the
savepoint before shutdown ?
[Ans ] Why task manager is shutdown suddenly? Are you saying about handling
unpredictable shutdown while taking
savepoint? In that case You can also use retained check point. In case
current checkpoint has issues because of shutdown
you will have previous checkpoint. So that you can use it. Now you will have
2 options, either savepoint/checkpoint. One of them
will always be available.
*[Followup Question]* When the processing service is shutdown say for
maintenance as it is a graceful shutdown we are looking at means to avoid
duplicates as exactly once message processing is guaranteed by our service .
We are already starting the job based on checkpoint or savepoint whichever
is the latest. When the job is started from last good checkpoint it results
in duplicates.

The job manager K8S service is configured as remote job manager address
in Task Manager. This service may not be available during savepoint,  will
this affect the communication between Task Manager and Job Manager during
savepoint ?
[Ans] you can go for HA right? Where you can run more than one jobmanager so
that one is always service is available
*[Followup Question]* As we mentioned above processing service is shut down
for maintenance.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/