(DEPRECATED) Apache Flink User Mailing List archive.

flink restoring from state

Classic

List

Threaded

3 messages Options

avilevi

flink restoring from state

Hi ,

Any help figuring this will be highly appreciated. we are running on GC , after uploading new jar with old savepoint (taken day before) some of our checkpoints are fails on "Checkpoint failed: The assigned slot container_e02_1550091678485_0001_01_000023_7 was removed." what is the reason for that ? some used to fail on timeout, but after I increased it to 15 min, Than some crashed on "Checkpoint failed: Checkpoint Coordinator is suspending". what can cause that and how to solve it ?

another question - recovering old state will case that the consumer will consume messages from that savepoint ?

regards

Avi

Screen Shot 2019-02-14 at 2.18.21.png (174K) Download Attachment

Congxian Qiu

Re: flink restoring from state

Hi, Avi

I think the "Checkpoint failed: The assigned slot container_e02_1550091678485_0001_01_000023_7 was removed"(this may be a container failure or something else, could double check the taskamanger log for more information)and "Checkpoint failed: Checkpoint Coordinator is suspending" are not the root cause, could you please share the jobmanager log

Whether the consumer consumes messages from that savepoint after recovering from the old state is controlled by the consumer, restoring just restore the offset if we snapshot it out when savepoint.

Best,

Congxian

Avi Levi <[hidden email]> 于2019年2月14日周四上午8:20写道：

Hi ,
Any help figuring this will be highly appreciated. we are running on GC , after uploading new jar with old savepoint (taken day before) some of our checkpoints are fails on "Checkpoint failed: The assigned slot container_e02_1550091678485_0001_01_000023_7 was removed." what is the reason for that ? some used to fail on timeout, but after I increased it to 15 min, Than some crashed on "Checkpoint failed: Checkpoint Coordinator is suspending". what can cause that and how to solve it ?

another question - recovering old state will case that the consumer will consume messages from that savepoint ?

regards
Avi

avilevi

Re: flink restoring from state

Thank you very much,

Please find attached the job manager log and the task manager log .

Thanks

Avi

On Thu, Feb 14, 2019 at 3:30 AM Congxian Qiu <[hidden email]> wrote:

Hi, Avi
I think the "Checkpoint failed: The assigned slot container_e02_1550091678485_0001_01_000023_7 was removed"(this may be a container failure or something else, could double check the taskamanger log for more information)and "Checkpoint failed: Checkpoint Coordinator is suspending" are not the root cause, could you please share the jobmanager log

Whether the consumer consumes messages from that savepoint after recovering from the old state is controlled by the consumer, restoring just restore the offset if we snapshot it out when savepoint.
Best,
Congxian

Avi Levi <[hidden email]> 于2019年2月14日周四上午8:20写道：
Hi ,
Any help figuring this will be highly appreciated. we are running on GC , after uploading new jar with old savepoint (taken day before) some of our checkpoints are fails on "Checkpoint failed: The assigned slot container_e02_1550091678485_0001_01_000023_7 was removed." what is the reason for that ? some used to fail on timeout, but after I increased it to 15 min, Than some crashed on "Checkpoint failed: Checkpoint Coordinator is suspending". what can cause that and how to solve it ?

another question - recovering old state will case that the consumer will consume messages from that savepoint ?

regards
Avi

Archive.zip (951K) Download Attachment