EOFException on attempt to scale up job with RocksDB state backend

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Hello,
I was trying to scale job up, took save point, changed parallelism setting from 6 to 8 and started job from savepoint:

switched from RUNNING to FAILED on 10.204.2.98:6122-2946e1 @ gsp-tm-0.gsp-headless.gsp.svc.cluster.local (dataPort=41409).
java.lang.Exception: Exception while creating StreamOperatorStateContext.
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:254)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:427)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:543)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:533)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_8131c39062c4256ee506e2382c4a7bfd_(3/8) from any of the 1 provided restore options.
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
    ... 9 common frames omitted
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
    at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:362)
    at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:587)
    at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:93)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
    ... 11 common frames omitted
Caused by: java.io.EOFException: null
    at java.io.DataInputStream.readShort(DataInputStream.java:315)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKVStateData(RocksDBFullRestoreOperation.java:230)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKeyGroupsInStateHandle(RocksDBFullRestoreOperation.java:163)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restore(RocksDBFullRestoreOperation.java:147)
    at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:285)
    ... 16 common frames omitted


Thanks,
Alexey
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Tzu-Li (Gordon) Tai
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Hi Yun,
TM log is attached.

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

scale-up-taskmanager_10.204.2.75_6122-58e766_log.zip (1M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
In reply to this post by Yun Tang
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

short-scale-up-taskmanager_10.204.2.75_6122-58e766_log (1M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

no-scale-taskmanager_10.204.2.75_6122-3cf8aa_log (1M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang
Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?


Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Hi Yun,
Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib


Thanks,
Alexey 


From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?


Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Screen Shot 2021-03-17 at 8.47.36 PM.png (570K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang
Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

Currently, I highly suspect this strange rescaling behavior is related with your underlying storage, could you try to use block blob instead of page blob [1] to see whether this behavior still existed?



Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 12:00
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib


Thanks,
Alexey 


From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?


Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Hi Yun,
How underlying storage explains fact that without re-scale I can restore from savepoint? Does Flink write file once or many times, if many times, then potentially could be problem with 50,000 blocks per blob limit, I'm I right? Should I try block blob with compaction like described in [1] or without compaction?

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 9:31 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

Currently, I highly suspect this strange rescaling behavior is related with your underlying storage, could you try to use block blob instead of page blob [1] to see whether this behavior still existed?



Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 12:00
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib


Thanks,
Alexey 


From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?


Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Yun Tang
Hi Alexey,

Flink would only write once for checkpointed files. Could you try to write checkpointed files as block blob format and see whether the problem still existed?

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 13:54
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
How underlying storage explains fact that without re-scale I can restore from savepoint? Does Flink write file once or many times, if many times, then potentially could be problem with 50,000 blocks per blob limit, I'm I right? Should I try block blob with compaction like described in [1] or without compaction?

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 9:31 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

Currently, I highly suspect this strange rescaling behavior is related with your underlying storage, could you try to use block blob instead of page blob [1] to see whether this behavior still existed?



Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 12:00
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib


Thanks,
Alexey 


From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?


Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
I Yun,
I've changed configuration to use block blobs, however due to another issue [1], I can't make savepoint, I hope eventually job will able to process backlog, then I will take savepoint, re-test and let you know

Thanks,
Alexey




From: Yun Tang <[hidden email]>
Sent: Thursday, March 18, 2021 5:08 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Flink would only write once for checkpointed files. Could you try to write checkpointed files as block blob format and see whether the problem still existed?

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 13:54
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
How underlying storage explains fact that without re-scale I can restore from savepoint? Does Flink write file once or many times, if many times, then potentially could be problem with 50,000 blocks per blob limit, I'm I right? Should I try block blob with compaction like described in [1] or without compaction?

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 9:31 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

Currently, I highly suspect this strange rescaling behavior is related with your underlying storage, could you try to use block blob instead of page blob [1] to see whether this behavior still existed?



Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 12:00
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib


Thanks,
Alexey 


From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?


Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: EOFException on attempt to scale up job with RocksDB state backend

Alexey Trenikhun
Hi Yun,
Finally I was able to try to rescale with block blobs configured - rescaled from 6 to 8 w/o problem. So loos like indeed there is problem with page blob.

Thank you for help,
Alexey

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 11:31 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
I Yun,
I've changed configuration to use block blobs, however due to another issue [1], I can't make savepoint, I hope eventually job will able to process backlog, then I will take savepoint, re-test and let you know

Thanks,
Alexey




From: Yun Tang <[hidden email]>
Sent: Thursday, March 18, 2021 5:08 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Flink would only write once for checkpointed files. Could you try to write checkpointed files as block blob format and see whether the problem still existed?

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 13:54
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
How underlying storage explains fact that without re-scale I can restore from savepoint? Does Flink write file once or many times, if many times, then potentially could be problem with 50,000 blocks per blob limit, I'm I right? Should I try block blob with compaction like described in [1] or without compaction?

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 9:31 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I am not familiar with azure blob storage and I cannot load the "_metadata" with your given file locally.

Currently, I highly suspect this strange rescaling behavior is related with your underlying storage, could you try to use block blob instead of page blob [1] to see whether this behavior still existed?



Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 12:00
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
Azure web UI shows size of all files created by Flink as 128Mib * X (128, 256, 640), see screenshot attached. In my understanding this is because Flink creates them as Page Blobs. In same storage other application creates files as block blobs and they have sizes not rounded on 128Mib


Thanks,
Alexey 


From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 8:38 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I tried to load your _metadata as checkpoint via Checkpoints#loadCheckpointMetadata [1] but found this file is actually not a savepoint meta, have you ever uploaded the correct files?
Moreover, I noticed that both size of 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata are 128MB which is much larger than its correct capacity, is this expected on azure blob storage or you just uploaded the wrong files?


Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Thursday, March 18, 2021 0:45
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,

I've copied 77e77928-cb26-4543-bd41-e785fcac49f0 and _metadata to Google drive:

Compression was never enabled (docs says that RocksDB's incremental checkpoints always use snappy compression, not sure does it have effect on savepoint or not)

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Wednesday, March 17, 2021 12:33 AM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your quick response. I have checked two different logs and still cannot understand why this could happen.

Take "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0" for example, the key group range offset has been intersected correctly during rescale for task "Intake voice calls (6/7)". The only place I could doubt is that azure blob storage did work as expected during seek offset [1].

Have you ever enabled snappy compression [2] [3] for savepoints?
Could you also share the file "wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/77e77928-cb26-4543-bd41-e785fcac49f0 " so that I could seek locally to see whether work as expected.
Moreover, could you also share savepoint meta data ""wasbs://[hidden email]/gsp/savepoints/savepoint-000000-67de6690143a/_metadata" ?


[3] <a href="https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression" id="LPlnk412049">https://ci.apache.org/projechttps://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-snapshot-compressions/flink/flink-docs-stable/ops/state/large_state_tuning.html#compression

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 14:25
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Attached. 


From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 11:13 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

Thanks for your reply, could you also share logs during normal restoring just as I wrote in previous thread so that I could compare.

Best
Yun Tang

From: Alexey Trenikhun <[hidden email]>
Sent: Wednesday, March 17, 2021 13:55
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Yun,
I'm attaching shorter version of log, looks like full version didn't come through

Thanks,
Alexey

From: Yun Tang <[hidden email]>
Sent: Tuesday, March 16, 2021 8:05 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi Alexey,

I believe your exception messages are printed from Flink-1.12.2 not Flink-1.12.1 due to the line number of method calling.

Could you share exception message of Flink-1.12.1 when rescaling? Moreover, I hope you could share more logs during restoring and rescaling. I want to see details of key group handle [1]


Best 

From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 15:10
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Also restore from same savepoint without change in parallelism works fine.


From: Alexey Trenikhun <[hidden email]>
Sent: Monday, March 15, 2021 9:51 PM
To: Yun Tang <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
No, I believe original exception was from 1.12.1 to 1.12.1

Thanks,
Alexey


From: Yun Tang <[hidden email]>
Sent: Monday, March 15, 2021 8:07:07 PM
To: Alexey Trenikhun <[hidden email]>; Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Can you scale the job at the same version from 1.12.1 to 1.12.1?

Best
Yun Tang


From: Alexey Trenikhun <[hidden email]>
Sent: Tuesday, March 16, 2021 4:46
To: Tzu-Li (Gordon) Tai <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Savepoint was taken with 1.12.1, I've tried to scale up using same version and 1.12.2


From: Tzu-Li (Gordon) Tai <[hidden email]>
Sent: Monday, March 15, 2021 12:06 AM
To: [hidden email] <[hidden email]>
Subject: Re: EOFException on attempt to scale up job with RocksDB state backend
 
Hi,

Could you provide info on the Flink version used?

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/