Checkpoint Error

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoint Error

Navneeth Krishnan
Hi All,

We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)
	at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
	... 7 more
Caused by: java.io.IOException: Stale file handle
	at java.io.FileOutputStream.close0(Native Method)
	at java.io.FileOutputStream.access$000(FileOutputStream.java:53)
	at java.io.FileOutputStream$1.close(FileOutputStream.java:356)
	at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)
	at java.io.FileOutputStream.close(FileOutputStream.java:354)
	at org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)
	at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)
	... 12 more

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint Error

Navneeth Krishnan
Hi All,

Any suggestions?

Thanks

On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)
	at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
	... 7 more
Caused by: java.io.IOException: Stale file handle
	at java.io.FileOutputStream.close0(Native Method)
	at java.io.FileOutputStream.access$000(FileOutputStream.java:53)
	at java.io.FileOutputStream$1.close(FileOutputStream.java:356)
	at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)
	at java.io.FileOutputStream.close(FileOutputStream.java:354)
	at org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)
	at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)
	... 12 more

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Re: Checkpoint Error

Yun Gao
Hi Navneeth,

It seems from the stack that the exception is caused by the underlying EFS problems ? Have you checked
if there are errors reported for EFS, or if there might be duplicate mounting for the same EFS and others
have ever deleted the directory?

Best,
Yun


------------------Original Mail ------------------
Sender:Navneeth Krishnan <[hidden email]>
Send Date:Sun Mar 7 15:44:59 2021
Recipients:user <[hidden email]>
Subject:Re: Checkpoint Error
Hi All,

Any suggestions?

Thanks

On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)... 6 moreCaused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:192)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)... 5 moreCaused by: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)... 7 moreCaused by: java.io.IOException: Stale file handleat java.io.FileOutputStream.close0(Native Method)at java.io.FileOutputStream.access$000(FileOutputStream.java:53)at java.io.FileOutputStream$1.close(FileOutputStream.java:356)at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at java.io.FileOutputStream.close(FileOutputStream.java:354)at org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)... 12 more

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Re: Checkpoint Error

Navneeth Krishnan
Hi Yun,

Thanks for the response. I checked the mounts and only the JM's and TM's are mounted with this EFS. Not sure how to debug this.

Thanks

On Sun, Mar 7, 2021 at 8:29 PM Yun Gao <[hidden email]> wrote:
Hi Navneeth,

It seems from the stack that the exception is caused by the underlying EFS problems ? Have you checked
if there are errors reported for EFS, or if there might be duplicate mounting for the same EFS and others
have ever deleted the directory?

Best,
Yun


------------------Original Mail ------------------
Sender:Navneeth Krishnan <[hidden email]>
Send Date:Sun Mar 7 15:44:59 2021
Recipients:user <[hidden email]>
Subject:Re: Checkpoint Error
Hi All,

Any suggestions?

Thanks

On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)... 6 moreCaused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:192)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)... 5 moreCaused by: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)... 7 moreCaused by: java.io.IOException: Stale file handleat java.io.FileOutputStream.close0(Native Method)at java.io.FileOutputStream.access$000(FileOutputStream.java:53)at java.io.FileOutputStream$1.close(FileOutputStream.java:356)at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at java.io.FileOutputStream.close(FileOutputStream.java:354)at org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)... 12 more

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Checkpoint Error

Yun Gao
Hi Navneeth,

Is the attached exception the root cause for the checkpoint failure ?
Namely is it also reported in job manager log?

Also, have you enabled concurrent checkpoint? 

Best,
 Yun


------------------Original Mail ------------------
Sender:Navneeth Krishnan <[hidden email]>
Send Date:Mon Mar 8 13:10:46 2021
Recipients:Yun Gao <[hidden email]>
CC:user <[hidden email]>
Subject:Re: Re: Checkpoint Error
Hi Yun,

Thanks for the response. I checked the mounts and only the JM's and TM's are mounted with this EFS. Not sure how to debug this.

Thanks

On Sun, Mar 7, 2021 at 8:29 PM Yun Gao <[hidden email]> wrote:
Hi Navneeth,

It seems from the stack that the exception is caused by the underlying EFS problems ? Have you checked
if there are errors reported for EFS, or if there might be duplicate mounting for the same EFS and others
have ever deleted the directory?

Best,
Yun


------------------Original Mail ------------------
Sender:Navneeth Krishnan <[hidden email]>
Send Date:Sun Mar 7 15:44:59 2021
Recipients:user <[hidden email]>
Subject:Re: Checkpoint Error
Hi All,

Any suggestions?

Thanks

On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)... 6 moreCaused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:192)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)... 5 moreCaused by: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)... 7 moreCaused by: java.io.IOException: Stale file handleat java.io.FileOutputStream.close0(Native Method)at java.io.FileOutputStream.access$000(FileOutputStream.java:53)at java.io.FileOutputStream$1.close(FileOutputStream.java:356)at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at java.io.FileOutputStream.close(FileOutputStream.java:354)at org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)... 12 more

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Checkpoint Error

Till Rohrmann
Could it be that another process might have deleted the in progress checkpoint file?

Cheers,
Till

On Mon, Mar 8, 2021 at 4:31 PM Yun Gao <[hidden email]> wrote:
Hi Navneeth,

Is the attached exception the root cause for the checkpoint failure ?
Namely is it also reported in job manager log?

Also, have you enabled concurrent checkpoint? 

Best,
 Yun


------------------Original Mail ------------------
Sender:Navneeth Krishnan <[hidden email]>
Send Date:Mon Mar 8 13:10:46 2021
Recipients:Yun Gao <[hidden email]>
CC:user <[hidden email]>
Subject:Re: Re: Checkpoint Error
Hi Yun,

Thanks for the response. I checked the mounts and only the JM's and TM's are mounted with this EFS. Not sure how to debug this.

Thanks

On Sun, Mar 7, 2021 at 8:29 PM Yun Gao <[hidden email]> wrote:
Hi Navneeth,

It seems from the stack that the exception is caused by the underlying EFS problems ? Have you checked
if there are errors reported for EFS, or if there might be duplicate mounting for the same EFS and others
have ever deleted the directory?

Best,
Yun


------------------Original Mail ------------------
Sender:Navneeth Krishnan <[hidden email]>
Send Date:Sun Mar 7 15:44:59 2021
Recipients:user <[hidden email]>
Subject:Re: Checkpoint Error
Hi All,

Any suggestions?

Thanks

On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)... 6 moreCaused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:192)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)... 5 moreCaused by: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)... 7 moreCaused by: java.io.IOException: Stale file handleat java.io.FileOutputStream.close0(Native Method)at java.io.FileOutputStream.access$000(FileOutputStream.java:53)at java.io.FileOutputStream$1.close(FileOutputStream.java:356)at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at java.io.FileOutputStream.close(FileOutputStream.java:354)at org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)... 12 more

Thanks