(DEPRECATED) Apache Flink User Mailing List archive.

EOFException on attempt to scale up job with RocksDB state backend

Classic

List

Threaded

20 messages Options

Alexey Trenikhun

EOFException on attempt to scale up job with RocksDB state backend

Hello,

I was trying to scale job up, took save point, changed parallelism setting from 6 to 8 and started job from savepoint:

switched from RUNNING to
FAILED on 10.204.2.98:6122-2946e1 @
 gsp-tm-0.gsp-headless.gsp.svc.cluster.local (dataPort=41409).java.lang.Exception:
Exception while creating
StreamOperatorStateContext.
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:254)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:427)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:543)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:533)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException:
Could not restore keyed state backend
for KeyedProcessOperator_8131c39062c4256ee506e2382c4a7bfd_(3/8)
from any of the
1 provided restore options.
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
    ... 9 common frames omitted
Caused by: org.apache.flink.runtime.state.BackendBuildingException:
Caught unexpected 
exception.
    at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:362)
    at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:587)
    at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:93)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
    ... 11 common frames omitted
Caused by: java.io.EOFException:
null
    at java.io.DataInputStream.readShort(DataInputStream.java:315)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKVStateData(RocksDBFullRestoreOperation.java:230)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKeyGroupsInStateHandle(RocksDBFullRestoreOperation.java:163)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restore(RocksDBFullRestoreOperation.java:147)
    at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:285)
    ... 16 common frames omitted

Thanks,
Alexey

Tzu-Li (Gordon) Tai

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Yun Tang

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Also restore from same savepoint without change in parallelism works fine.

From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Yun Tang

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

[1] https://github.com/apache/flink/blob/dc404e2538fdfbc98b9c565951f30f922bf7cedd/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/restore/RocksDBFullRestoreOperation.java#L153

Best

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

TM log is attached.

Thanks,

Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

scale-up-taskmanager_10.204.2.75_6122-58e766_log.zip (1M) Download Attachment

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

In reply to this post by Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

short-scale-up-taskmanager_10.204.2.75_6122-58e766_log (1M) Download Attachment

Yun Tang

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Attached.

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

no-scale-taskmanager_10.204.2.75_6122-3cf8aa_log (1M) Download Attachment

Yun Tang

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Yun Tang

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?

Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?

[1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best

Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib

Thanks,

Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?

[1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best

Yun Tang

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Screen Shot 2021-03-17 at 8.47.36 PM.png (570K) Download Attachment

Yun Tang

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

Currently, I highly suspect this strange rescaling behavior is related with your underlying storage, could you try to use block blob instead of page blob [1] to see whether this behavior still existed?

[1] https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Block_Blob_with_Compaction_Support_and_Configuration

Best

Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 12:00
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?

[1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best

Yun Tang

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

How underlying storage explains fact that without re-scale I can restore from savepoint? Does Flink write file once or many times, if many times, then potentially could be problem with 50,000 blocks per blob limit, I'm I right? Should I try block blob with compaction like described in [1] or without compaction?

Thanks,

Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 9:31 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

[1] https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Block_Blob_with_Compaction_Support_and_Configuration

Best

Yun Tang

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?

[1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best

Yun Tang

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Yun Tang

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Flink would only write once for checkpointed files. Could you try to write checkpointed files as block blob format and see whether the problem still existed?

Best

Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 13:54
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

[1] https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Block_Blob_with_Compaction_Support_and_Configuration

Best

Yun Tang

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?

[1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best

Yun Tang

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

I Yun,

I've changed configuration to use block blobs, however due to another issue [1], I can't make savepoint, I hope eventually job will able to process backlog, then I will take savepoint, re-test and let you know

Thanks,

Alexey

[1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-fail-due-to-timeout-td42125.html#a42248

Apache Flink User Mailing List archive. - Checkpoint fail due to timeout

Checkpoint fail due to timeout. Hello, We are experiencing the problem with checkpoints failing due to timeout (already set to 30 minute, still failing), checkpoints were not too big before they...

apache-flink-user-mailing-list-archive.2336050.n4.nabble.com

From: Yun Tang <[hidden email]>
Sent: Thursday, March 18, 2021 5:08 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Alexey,

Flink would only write once for checkpointed files. Could you try to write checkpointed files as block blob format and see whether the problem still existed?

Best

Yun Tang

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

[1] https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Block_Blob_with_Compaction_Support_and_Configuration

Best

Yun Tang

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?

[1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best

Yun Tang

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Alexey Trenikhun

Re: EOFException on attempt to scale up job with RocksDB state backend

Hi Yun,

Finally I was able to try to rescale with block blobs configured - rescaled from 6 to 8 w/o problem. So loos like indeed there is problem with page blob.

Thank you for help,

Alexey

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 11:31 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend

I Yun,

Thanks,

Alexey

[1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-fail-due-to-timeout-td42125.html#a42248

Apache Flink User Mailing List archive. - Checkpoint fail due to timeout

Checkpoint fail due to timeout. Hello, We are experiencing the problem with checkpoints failing due to timeout (already set to 30 minute, still failing), checkpoints were not too big before they...

apache-flink-user-mailing-list-archive.2336050.n4.nabble.com

Hi Alexey,

Flink would only write once for checkpointed files. Could you try to write checkpointed files as block blob format and see whether the problem still existed?

Best

Yun Tang

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

[1] https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Block_Blob_with_Compaction_Support_and_Configuration

Best

Yun Tang

Hi Yun,

Thanks,

Alexey

Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?

[1] https://github.com/apache/flink/blob/956c0716fdbf20bf53305fe4d023fa2bea412595/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99

Best

Yun Tang

Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

https://drive.google.com/drive/folders/1J3nwvQupLBT5ZaN_qEmc2y_-MgFz0cLb?usp=sharing

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,

Alexey

Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Have you ever enabled snappy compression [2] [3] for savepoints?

Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.

Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best

Yun Tang

Attached.

Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best

Yun Tang

Hi Yun,

I'm attaching shorter version of log, looks like full version didn't come through

Thanks,

Alexey

Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]

Best

Also restore from same savepoint without change in parallelism works fine.

No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,

Alexey

Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best

Yun Tang

Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2

Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/