Job Restart Failure

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Job Restart Failure

Navneeth Krishnan
Hi All,

I'm facing an issue in our flink application. This happens in version 1.4.0 and 1.7.2. We have both versions and we are seeing this problem on both. We are running flink on ECS and checkpointing enabled to EFS. When the pipeline restarts due to some node failure or any other reason, it just keeps restarting until the retry attempts without this same error message. When I checked the EFS volume I do see the file is still available but for some reason flink is unable to recover the job. Any pointers will help. Thanks

java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore operator state backend for StreamSource_cbc357ccb763df2852fee8c4fc7d55f2_(14/18) from any of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.operatorStateBackend(StreamTaskStateInitializerImpl.java:245)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:143)
	... 5 more
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/150dee2a70cecdd41b63a06b42a95649/chk-52/76363f89-d19f-44aa-aaf9-b33d89ec7c6c (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
	at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
	at org.apache.flink.runtime.state.OperatorStreamStateHandle.openInputStream(OperatorStreamStateHandle.java:66)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:286)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:62)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
	... 7 more

EFS:

image.png


Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Job Restart Failure

Navneeth Krishnan
Hi All,

Any feedback on how this can be resolved? This is causing downtime in production.

Thanks



On Tue, Oct 20, 2020 at 4:39 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

I'm facing an issue in our flink application. This happens in version 1.4.0 and 1.7.2. We have both versions and we are seeing this problem on both. We are running flink on ECS and checkpointing enabled to EFS. When the pipeline restarts due to some node failure or any other reason, it just keeps restarting until the retry attempts without this same error message. When I checked the EFS volume I do see the file is still available but for some reason flink is unable to recover the job. Any pointers will help. Thanks

java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore operator state backend for StreamSource_cbc357ccb763df2852fee8c4fc7d55f2_(14/18) from any of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.operatorStateBackend(StreamTaskStateInitializerImpl.java:245)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:143)
	... 5 more
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/150dee2a70cecdd41b63a06b42a95649/chk-52/76363f89-d19f-44aa-aaf9-b33d89ec7c6c (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
	at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
	at org.apache.flink.runtime.state.OperatorStreamStateHandle.openInputStream(OperatorStreamStateHandle.java:66)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:286)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:62)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
	... 7 more

EFS:

image.png


Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Job Restart Failure

Till Rohrmann
Hi Navneeth,

sorry for the late reply. To me it looks as if /mnt/checkpoints/150dee2a70cecdd41b63a06b42a95649/chk-52/76363f89-d19f-44aa-aaf9-b33d89ec7c6c has not been mounted to the EC2 machine you are using to run the job. Could you try to log in onto the machine when the problem occurs and check whether you can open the checkpointing path? Maybe the EFS troubleshooting guide might also be of help [1].


Cheers,
Till

On Wed, Oct 21, 2020 at 7:46 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Any feedback on how this can be resolved? This is causing downtime in production.

Thanks



On Tue, Oct 20, 2020 at 4:39 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

I'm facing an issue in our flink application. This happens in version 1.4.0 and 1.7.2. We have both versions and we are seeing this problem on both. We are running flink on ECS and checkpointing enabled to EFS. When the pipeline restarts due to some node failure or any other reason, it just keeps restarting until the retry attempts without this same error message. When I checked the EFS volume I do see the file is still available but for some reason flink is unable to recover the job. Any pointers will help. Thanks

java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore operator state backend for StreamSource_cbc357ccb763df2852fee8c4fc7d55f2_(14/18) from any of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.operatorStateBackend(StreamTaskStateInitializerImpl.java:245)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:143)
	... 5 more
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/150dee2a70cecdd41b63a06b42a95649/chk-52/76363f89-d19f-44aa-aaf9-b33d89ec7c6c (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
	at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
	at org.apache.flink.runtime.state.OperatorStreamStateHandle.openInputStream(OperatorStreamStateHandle.java:66)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:286)
	at org.apache.flink.runtime.state.DefaultOperatorStateBackend.restore(DefaultOperatorStateBackend.java:62)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
	... 7 more

EFS:

image.png


Thanks