1.5 some thing weird

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

1.5 some thing weird

Vishal Santoshi
drwxr-xr-x   - root hadoop          0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x   - root hadoop          0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x   - root hadoop          0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x   - root hadoop          0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x   - root hadoop          0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this exception was thrown


the  chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time 

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..


Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ).  It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.






 

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.


Reply | Threaded
Open this post in threaded view
|

Re: 1.5 some thing weird

Till Rohrmann
Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,
Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:
drwxr-xr-x   - root hadoop          0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x   - root hadoop          0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x   - root hadoop          0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x   - root hadoop          0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x   - root hadoop          0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this exception was thrown


the  chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time 

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..


Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ).  It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.






 

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.


Reply | Threaded
Open this post in threaded view
|

Re: 1.5 some thing weird

Vishal Santoshi
That makes sense, what does not make sense is that the pipeline restarted. I would have imagined that an aborted chk point would not abort the pipeline. 

On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,
Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:
drwxr-xr-x   - root hadoop          0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x   - root hadoop          0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x   - root hadoop          0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x   - root hadoop          0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x   - root hadoop          0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this exception was thrown


the  chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time 

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..


Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ).  It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.






 

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.



Reply | Threaded
Open this post in threaded view
|

Re: 1.5 some thing weird

Till Rohrmann
Whether a Flink task should fail in case of a checkpoint error or not can be configured via the CheckpointConfig which you can access via the StreamExecutionEnvironment. You have to call `CheckpointConfig#setFailOnCheckpointingErrors(false)` to deactivate the default behaviour where the task always fails in case of a checkpoint error.

Cheers,
Till

On Tue, Jul 10, 2018 at 10:50 AM Vishal Santoshi <[hidden email]> wrote:
That makes sense, what does not make sense is that the pipeline restarted. I would have imagined that an aborted chk point would not abort the pipeline. 

On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,
Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:
drwxr-xr-x   - root hadoop          0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x   - root hadoop          0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x   - root hadoop          0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x   - root hadoop          0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x   - root hadoop          0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this exception was thrown


the  chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time 

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..


Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ).  It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.






 

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.



Reply | Threaded
Open this post in threaded view
|

Re: 1.5 some thing weird

Vishal Santoshi
 Will try the setting out.  Do not want to push it, but the exception can be much more descriptive :)  

Thanks much

On Tue, Jul 10, 2018 at 7:48 AM, Till Rohrmann <[hidden email]> wrote:
Whether a Flink task should fail in case of a checkpoint error or not can be configured via the CheckpointConfig which you can access via the StreamExecutionEnvironment. You have to call `CheckpointConfig#setFailOnCheckpointingErrors(false)` to deactivate the default behaviour where the task always fails in case of a checkpoint error.

Cheers,
Till

On Tue, Jul 10, 2018 at 10:50 AM Vishal Santoshi <[hidden email]> wrote:
That makes sense, what does not make sense is that the pipeline restarted. I would have imagined that an aborted chk point would not abort the pipeline. 

On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,
Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:
drwxr-xr-x   - root hadoop          0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x   - root hadoop          0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x   - root hadoop          0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x   - root hadoop          0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x   - root hadoop          0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the  2018-07-09, 12:38:43   this exception was thrown


the  chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time 

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..


Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ).  It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.






 

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.