Restoring from Flink Savepoint in Kubernetes not working

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Restoring from Flink Savepoint in Kubernetes not working

Claude Murad
Hello, 

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12.  I created a savepoint using the curl command and the status indicated it was completed.  I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created.  The job id passed in also does not match the job id that was launched.  I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches.  It seems these arguments are not even being evaluated.  Any ideas about this? 


Thanks  


Reply | Threaded
Open this post in threaded view
|

Re: Restoring from Flink Savepoint in Kubernetes not working

Matthias
Hi Claude,
thanks for reaching out to the Flink community. Could you provide the Flink logs for this run to get a better understanding of what's going on? Additionally, what exact Flink 1.12 version are you using? Did you also verify that the snapshot was created by checking the actual folder?

Best,
Matthias

On Wed, Mar 31, 2021 at 4:56 AM Claude M <[hidden email]> wrote:
Hello, 

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12.  I created a savepoint using the curl command and the status indicated it was completed.  I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created.  The job id passed in also does not match the job id that was launched.  I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches.  It seems these arguments are not even being evaluated.  Any ideas about this? 


Thanks  
Reply | Threaded
Open this post in threaded view
|

Re: Restoring from Flink Savepoint in Kubernetes not working

Claude Murad
Thanks for your reply.  I'm using the flink docker image flink:1.12.2-scala_2.11-java8.  Yes, the folder was created in S3.  I took a look at the UI and it showed the following:

Latest Restore ID: 49Restore Time: 2021-03-31 09:37:43Type: CheckpointPath: s3://<bucket>/<folder>/fcc82deebb4565f31a7f63989939c463/chk-49

However, this is different from the savepoint path I specified.  I specified the following:

s3://<bucket>/<folder>/savepoint2/savepoint-9fe457-504c312ffabe

Is there anything specific you're looking for in the logs?  I did not find any exceptions and there is a lot of sensitive information I would have to extract from it. 

Also, this morning, I tried creating another savepoint.  It first showed it was In Progress.  
Then later when I tried to check the status, I saw the attached exception.  

In the UI, I see the following:

Latest Failed Checkpoint ID: 50Failure Time: 2021-03-31 09:34:43Cause: Asynchronous task checkpoint failed.

What does this failure mean?  


On Wed, Mar 31, 2021 at 9:22 AM Matthias Pohl <[hidden email]> wrote:
Hi Claude,
thanks for reaching out to the Flink community. Could you provide the Flink logs for this run to get a better understanding of what's going on? Additionally, what exact Flink 1.12 version are you using? Did you also verify that the snapshot was created by checking the actual folder?

Best,
Matthias

On Wed, Mar 31, 2021 at 4:56 AM Claude M <[hidden email]> wrote:
Hello, 

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12.  I created a savepoint using the curl command and the status indicated it was completed.  I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created.  The job id passed in also does not match the job id that was launched.  I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches.  It seems these arguments are not even being evaluated.  Any ideas about this? 


Thanks  

SavePointError.txt (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Restoring from Flink Savepoint in Kubernetes not working

Matthias
The logs would have helped to understand better what you were doing.

The stacktrace you shared indicates that you either asked for the status of a savepoint creation that had already been completed and was, therefore, removed from the operations cache or you used some job ID/request ID pair that was not connected with any savepoint creation operation.
The operations are only cached for 300 seconds before being removed from the cache. You could verify that the specific operation did expire and was removed from the cache in the logs [1] stating something like: "Evicted result with trigger id {} because its TTL of {}s has expired."

But you should be also able to verify the completion of the savepoint in the logs.


On Wed, Mar 31, 2021 at 4:46 PM Claude M <[hidden email]> wrote:
Thanks for your reply.  I'm using the flink docker image flink:1.12.2-scala_2.11-java8.  Yes, the folder was created in S3.  I took a look at the UI and it showed the following:

Latest Restore ID: 49Restore Time: 2021-03-31 09:37:43Type: CheckpointPath: s3://<bucket>/<folder>/fcc82deebb4565f31a7f63989939c463/chk-49

However, this is different from the savepoint path I specified.  I specified the following:

s3://<bucket>/<folder>/savepoint2/savepoint-9fe457-504c312ffabe

Is there anything specific you're looking for in the logs?  I did not find any exceptions and there is a lot of sensitive information I would have to extract from it. 

Also, this morning, I tried creating another savepoint.  It first showed it was In Progress.  
Then later when I tried to check the status, I saw the attached exception.  

In the UI, I see the following:

Latest Failed Checkpoint ID: 50Failure Time: 2021-03-31 09:34:43Cause: Asynchronous task checkpoint failed.

What does this failure mean?  


On Wed, Mar 31, 2021 at 9:22 AM Matthias Pohl <[hidden email]> wrote:
Hi Claude,
thanks for reaching out to the Flink community. Could you provide the Flink logs for this run to get a better understanding of what's going on? Additionally, what exact Flink 1.12 version are you using? Did you also verify that the snapshot was created by checking the actual folder?

Best,
Matthias

On Wed, Mar 31, 2021 at 4:56 AM Claude M <[hidden email]> wrote:
Hello, 

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12.  I created a savepoint using the curl command and the status indicated it was completed.  I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created.  The job id passed in also does not match the job id that was launched.  I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches.  It seems these arguments are not even being evaluated.  Any ideas about this? 


Thanks