(DEPRECATED) Apache Flink User Mailing List archive.

standalone flink savepoint restoration

Classic

List

Threaded

3 messages Options

Matt Anger

standalone flink savepoint restoration

Hello everyone,

I am running a flink job in k8s as a standalone HA job. Now I updated my job w/ some additional sinks, which I guess have made the checkpoints incompatible with the newer version, meaning flink now crashes on bootup with the following:

Caused by: java.lang.IllegalStateException: There is no operator for the state c9b81dfc309f1368ac7efb5864e7b693

So I rollback the deployment, log into the pod and create a savestate, and then modify my args to add

--allowNonRestoredState

and

-s <savepoint-dir>

but it doesn't look like the standalone cluster is respecting those arguments. I've tried searching around, but haven't found any solutions. The docker image I have is running the docker-entrypoint.sh and the full arg list is below as copy-pastad out of my k8s yaml file:

47 - job-cluster
48 - -Djobmanager.rpc.address=$(SERVICE_NAME)
49 - -Djobmanager.rpc.port=6123
50 - -Dresourcemanager.rpc.port=6123
51 - -Dparallelism.default=$(NUM_WORKERS)
52 - -Dblob.server.port=6124
53 - -Dqueryable-state.server.ports=6125
54 - -Ds3.access-key=$(AWS_ACCESS_KEY_ID)
55 - -Ds3.secret-key=$(AWS_SECRET_ACCESS_KEY)
56 - -Dhigh-availability=zookeeper
57 - -Dhigh-availability.jobmanager.port=50010
58 - -Dhigh-availability.storageDir=$(S3_HA)
59 - -Dhigh-availability.zookeeper.quorum=$(ZK_QUORUM)
60 - -Dstate.backend=filesystem
61 - -Dstate.checkpoints.dir=$(S3_CHECKPOINT)
62 - -Dstate.savepoints.dir=$(S3_SAVEPOINT)
63 - --allowNonRestoredState
64 - -s $(S3_SAVEPOINT)

I originally didn't have the last 2 args, I added them based upon various emails I saw on this list and other google search results, to no avail.

Thanks

-Matt

Yun Tang

Re: standalone flink savepoint restoration

Hi Matt

Have you ever configured `high-availability.cluster-id` ? If not, Flink standalone job would first try to recover from high-availability checkpoint store named '/default'. If there existed a checkpoint, Flink would always restore from checkpoint disabling 'allowNonRestoredState'[1] (always passing 'false' in). Please consider to configure `high-availability.cluster-id` to different values to enable you could resume job with dropping some operators.

[1] https://github.com/apache/flink/blob/7670e237d7d8d3727537c09b8695c860ea92d467/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/LegacyScheduler.java#L190

Best

Yun Tang

From: Matt Anger <[hidden email]>
Sent: Thursday, October 17, 2019 5:46
To: [hidden email] <[hidden email]>
Subject: standalone flink savepoint restoration

Hello everyone,

Caused by: java.lang.IllegalStateException: There is no operator for the state c9b81dfc309f1368ac7efb5864e7b693

So I rollback the deployment, log into the pod and create a savestate, and then modify my args to add

--allowNonRestoredState

and

-s <savepoint-dir>

I originally didn't have the last 2 args, I added them based upon various emails I saw on this list and other google search results, to no avail.

Thanks

-Matt

Congxian Qiu

Re: standalone flink savepoint restoration

Do you specify the operatorid for all the operators?[1][2], asking this because from the exception you gave, if you only add new operators and all the old operators have specified operatorid, seems there would not throw such exception.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#assigning-operator-ids

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#should-i-assign-ids-to-all-operators-in-my-job

Best,

Congxian

Yun Tang <[hidden email]> 于2019年10月17日周四下午12:31写道：

Hi Matt

Have you ever configured `high-availability.cluster-id` ? If not, Flink standalone job would first try to recover from high-availability checkpoint store named '/default'. If there existed a checkpoint, Flink would always restore from checkpoint disabling 'allowNonRestoredState'[1] (always passing 'false' in). Please consider to configure `high-availability.cluster-id` to different values to enable you could resume job with dropping some operators.

[1] https://github.com/apache/flink/blob/7670e237d7d8d3727537c09b8695c860ea92d467/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/LegacyScheduler.java#L190

Best

Yun Tang

From: Matt Anger <[hidden email]>
Sent: Thursday, October 17, 2019 5:46
To: [hidden email] <[hidden email]>
Subject: standalone flink savepoint restoration

Hello everyone,
I am running a flink job in k8s as a standalone HA job. Now I updated my job w/ some additional sinks, which I guess have made the checkpoints incompatible with the newer version, meaning flink now crashes on bootup with the following:

Caused by: java.lang.IllegalStateException: There is no operator for the state c9b81dfc309f1368ac7efb5864e7b693

So I rollback the deployment, log into the pod and create a savestate, and then modify my args to add

--allowNonRestoredState

and

-s <savepoint-dir>

but it doesn't look like the standalone cluster is respecting those arguments. I've tried searching around, but haven't found any solutions. The docker image I have is running the docker-entrypoint.sh and the full arg list is below as copy-pastad out of my k8s yaml file:

47 - job-cluster
48 - -Djobmanager.rpc.address=$(SERVICE_NAME)
49 - -Djobmanager.rpc.port=6123
50 - -Dresourcemanager.rpc.port=6123
51 - -Dparallelism.default=$(NUM_WORKERS)
52 - -Dblob.server.port=6124
53 - -Dqueryable-state.server.ports=6125
54 - -Ds3.access-key=$(AWS_ACCESS_KEY_ID)
55 - -Ds3.secret-key=$(AWS_SECRET_ACCESS_KEY)
56 - -Dhigh-availability=zookeeper
57 - -Dhigh-availability.jobmanager.port=50010
58 - -Dhigh-availability.storageDir=$(S3_HA)
59 - -Dhigh-availability.zookeeper.quorum=$(ZK_QUORUM)
60 - -Dstate.backend=filesystem
61 - -Dstate.checkpoints.dir=$(S3_CHECKPOINT)
62 - -Dstate.savepoints.dir=$(S3_SAVEPOINT)
63 - --allowNonRestoredState
64 - -s $(S3_SAVEPOINT)

I originally didn't have the last 2 args, I added them based upon various emails I saw on this list and other google search results, to no avail.

Thanks

-Matt