(DEPRECATED) Apache Flink User Mailing List archive.

Restoring from Flink Savepoint in Kubernetes not working

Classic

List

Threaded

4 messages Options

Claude Murad

Restoring from Flink Savepoint in Kubernetes not working

Hello,

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12. I created a savepoint using the curl command and the status indicated it was completed. I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created. The job id passed in also does not match the job id that was launched. I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches. It seems these arguments are not even being evaluated. Any ideas about this?

Thanks

Matthias

Re: Restoring from Flink Savepoint in Kubernetes not working

Hi Claude,

thanks for reaching out to the Flink community. Could you provide the Flink logs for this run to get a better understanding of what's going on? Additionally, what exact Flink 1.12 version are you using? Did you also verify that the snapshot was created by checking the actual folder?

Best,
Matthias

On Wed, Mar 31, 2021 at 4:56 AM Claude M <[hidden email]> wrote:

Hello,

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12. I created a savepoint using the curl command and the status indicated it was completed. I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created. The job id passed in also does not match the job id that was launched. I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches. It seems these arguments are not even being evaluated. Any ideas about this?

Thanks

Claude Murad

Re: Restoring from Flink Savepoint in Kubernetes not working

Thanks for your reply. I'm using the flink docker image flink:1.12.2-scala_2.11-java8. Yes, the folder was created in S3. I took a look at the UI and it showed the following:

Latest Restore ID: 49Restore Time: 2021-03-31 09:37:43Type: CheckpointPath: s3://<bucket>/<folder>/fcc82deebb4565f31a7f63989939c463/chk-49

However, this is different from the savepoint path I specified. I specified the following:

s3://<bucket>/<folder>/savepoint2/savepoint-9fe457-504c312ffabe

Is there anything specific you're looking for in the logs? I did not find any exceptions and there is a lot of sensitive information I would have to extract from it.

Also, this morning, I tried creating another savepoint. It first showed it was In Progress.

curl http://localhost:8081/jobs/fcc82deebb4565f31a7f63989939c463/savepoints/4d19307dd99337257c4738871b1c63d8
{"status":{"id":"IN_PROGRESS"},"operation":null}

Then later when I tried to check the status, I saw the attached exception.

In the UI, I see the following:

Latest Failed Checkpoint ID: 50Failure Time: 2021-03-31 09:34:43Cause: Asynchronous task checkpoint failed.

What does this failure mean?

On Wed, Mar 31, 2021 at 9:22 AM Matthias Pohl <[hidden email]> wrote:

Hi Claude,
thanks for reaching out to the Flink community. Could you provide the Flink logs for this run to get a better understanding of what's going on? Additionally, what exact Flink 1.12 version are you using? Did you also verify that the snapshot was created by checking the actual folder?

Best,
Matthias

On Wed, Mar 31, 2021 at 4:56 AM Claude M <[hidden email]> wrote:
Hello,

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12. I created a savepoint using the curl command and the status indicated it was completed. I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created. The job id passed in also does not match the job id that was launched. I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches. It seems these arguments are not even being evaluated. Any ideas about this?

Thanks

SavePointError.txt (9K) Download Attachment

Matthias

Re: Restoring from Flink Savepoint in Kubernetes not working

The logs would have helped to understand better what you were doing.

The stacktrace you shared indicates that you either asked for the status of a savepoint creation that had already been completed and was, therefore, removed from the operations cache or you used some job ID/request ID pair that was not connected with any savepoint creation operation.

The operations are only cached for 300 seconds before being removed from the cache. You could verify that the specific operation did expire and was removed from the cache in the logs [1] stating something like: "Evicted result with trigger id {} because its TTL of {}s has expired."

But you should be also able to verify the completion of the savepoint in the logs.

[1] https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L104

On Wed, Mar 31, 2021 at 4:46 PM Claude M <[hidden email]> wrote:

Thanks for your reply. I'm using the flink docker image flink:1.12.2-scala_2.11-java8. Yes, the folder was created in S3. I took a look at the UI and it showed the following:

Latest Restore ID: 49Restore Time: 2021-03-31 09:37:43Type: CheckpointPath: s3://<bucket>/<folder>/fcc82deebb4565f31a7f63989939c463/chk-49

However, this is different from the savepoint path I specified. I specified the following:

s3://<bucket>/<folder>/savepoint2/savepoint-9fe457-504c312ffabe

Is there anything specific you're looking for in the logs? I did not find any exceptions and there is a lot of sensitive information I would have to extract from it.

Also, this morning, I tried creating another savepoint. It first showed it was In Progress.
curl http://localhost:8081/jobs/fcc82deebb4565f31a7f63989939c463/savepoints/4d19307dd99337257c4738871b1c63d8
{"status":{"id":"IN_PROGRESS"},"operation":null}
Then later when I tried to check the status, I saw the attached exception.

In the UI, I see the following:

Latest Failed Checkpoint ID: 50Failure Time: 2021-03-31 09:34:43Cause: Asynchronous task checkpoint failed.

What does this failure mean?
On Wed, Mar 31, 2021 at 9:22 AM Matthias Pohl <[hidden email]> wrote:
Hi Claude,
thanks for reaching out to the Flink community. Could you provide the Flink logs for this run to get a better understanding of what's going on? Additionally, what exact Flink 1.12 version are you using? Did you also verify that the snapshot was created by checking the actual folder?

Best,
Matthias

On Wed, Mar 31, 2021 at 4:56 AM Claude M <[hidden email]> wrote:
Hello,

I have Flink setup as an Application Cluster in Kubernetes, using Flink version 1.12. I created a savepoint using the curl command and the status indicated it was completed. I then tried to relaunch the job from that save point using the following arguments as indicated in the doc found here: https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "<class-name>", "--job-id", "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>", "--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as when the savepoint was created. The job id passed in also does not match the job id that was launched. I even put an incorrect savepoint path to see what happens and there were no errors in the logs and the job still launches. It seems these arguments are not even being evaluated. Any ideas about this?

Thanks