(DEPRECATED) Apache Flink User Mailing List archive.

1.5 some thing weird

Classic

List

Threaded

5 messages Options

Vishal Santoshi

1.5 some thing weird

drwxr-xr-x - root hadoop 0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123

drwxr-xr-x - root hadoop 0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124

drwxr-xr-x - root hadoop 0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126

drwxr-xr-x - root hadoop 0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127

drwxr-xr-x - root hadoop 0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the 2018-07-09, 12:38:43 this exception was thrown

the chk-125 is missing from hdfs and the job complains about it

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..

Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ). It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.

This is the full exception.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

Till Rohrmann

Re: 1.5 some thing weird

Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,

Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:

drwxr-xr-x - root hadoop 0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123

drwxr-xr-x - root hadoop 0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124

drwxr-xr-x - root hadoop 0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126

drwxr-xr-x - root hadoop 0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127

drwxr-xr-x - root hadoop 0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the 2018-07-09, 12:38:43 this exception was thrown

the chk-125 is missing from hdfs and the job complains about it

At about the same time

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..

This is the full exception.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

Vishal Santoshi

Re: 1.5 some thing weird

That makes sense, what does not make sense is that the pipeline restarted. I would have imagined that an aborted chk point would not abort the pipeline.

On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <[hidden email]> wrote:

Hi Vishal,

Cheers,

Till

On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:

drwxr-xr-x - root hadoop 0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123

drwxr-xr-x - root hadoop 0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124

drwxr-xr-x - root hadoop 0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126

drwxr-xr-x - root hadoop 0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127

drwxr-xr-x - root hadoop 0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the 2018-07-09, 12:38:43 this exception was thrown

the chk-125 is missing from hdfs and the job complains about it

At about the same time

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..

This is the full exception.

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).}
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
	at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
	... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle
	at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
	at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705)
	at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641)
	at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

Till Rohrmann

Re: 1.5 some thing weird

Whether a Flink task should fail in case of a checkpoint error or not can be configured via the CheckpointConfig which you can access via the StreamExecutionEnvironment. You have to call `CheckpointConfig#setFailOnCheckpointingErrors(false)` to deactivate the default behaviour where the task always fails in case of a checkpoint error.

Cheers,

Till

On Tue, Jul 10, 2018 at 10:50 AM Vishal Santoshi <[hidden email]> wrote:

That makes sense, what does not make sense is that the pipeline restarted. I would have imagined that an aborted chk point would not abort the pipeline.
On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,
Till
On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:
drwxr-xr-x - root hadoop 0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x - root hadoop 0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x - root hadoop 0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x - root hadoop 0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x - root hadoop 0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the 2018-07-09, 12:38:43 this exception was thrown

the chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..

Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ). It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6). at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943) ... 6 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53) at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854) ... 5 more Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325) at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77) at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705) at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641) at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

Vishal Santoshi

Re: 1.5 some thing weird

Will try the setting out. Do not want to push it, but the exception can be much more descriptive :)

Thanks much

On Tue, Jul 10, 2018 at 7:48 AM, Till Rohrmann <[hidden email]> wrote:

Whether a Flink task should fail in case of a checkpoint error or not can be configured via the CheckpointConfig which you can access via the StreamExecutionEnvironment. You have to call `CheckpointConfig#setFailOnCheckpointingErrors(false)` to deactivate the default behaviour where the task always fails in case of a checkpoint error.

Cheers,
Till
On Tue, Jul 10, 2018 at 10:50 AM Vishal Santoshi <[hidden email]> wrote:
That makes sense, what does not make sense is that the pipeline restarted. I would have imagined that an aborted chk point would not abort the pipeline.
On Tue, Jul 10, 2018 at 3:16 AM, Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

it looks as if the flushing of the checkpoint data to HDFS failed due to some expired lease on the checkpoint file. Therefore, Flink aborted the checkpoint `chk-125` and removed it. This is the normal behaviour if Flink cannot complete a checkpoint. As you can see, afterwards, the checkpoints are again successful.

Cheers,
Till
On Mon, Jul 9, 2018 at 7:15 PM Vishal Santoshi <[hidden email]> wrote:
drwxr-xr-x - root hadoop 0 2018-07-09 12:33 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-123
drwxr-xr-x - root hadoop 0 2018-07-09 12:35 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-124
drwxr-xr-x - root hadoop 0 2018-07-09 12:51 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-126
drwxr-xr-x - root hadoop 0 2018-07-09 12:53 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-127
drwxr-xr-x - root hadoop 0 2018-07-09 12:55 /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-128

See the missing chk-125

So I see the above checkpoints for a job. at the 2018-07-09, 12:38:43 this exception was thrown

the chk-125 is missing from hdfs and the job complains about it
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.

At about the same time

ID: 125Failure Time: 12:38:23Cause: Checkpoint expired before completing..

Is this some race condition. A checkpoint had to be taken and , that was was chk-125, it took longer than the configure time ( 1 minute ). It aborted the pipe. Should it have ? It actually did not even create the chk-125 but then refers to it and aborts the pipe.

This is the full exception.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.Exception: Could not materialize checkpoint 125 for operator 360 minute interval -> 360 minutes to TimeSeries.Entry.2 (5/6). at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943) ... 6 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53) at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854) ... 5 more Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://nn-crunchy.bf2.tumblr.net:8020/flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e in order to obtain the stream state handle at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325) at org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77) at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:705) at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$HeapSnapshotStrategy$1.performOperation(HeapKeyedStateBackend.java:641) at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /flink/kpi_unique/392d0436e53f3ef5e494ba3cc63428bf/chk-125/e9d6886c-e693-4827-97bc-dd3fd526b64e (inode 1987098987): File does not exist. Holder DFSClient_NONMAPREDUCE_1527557459_11240 does not have any open files.