Checkpoints issue and job failing

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoints issue and job failing

Navneeth Krishnan
Hi All,

We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 

image.png

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoints issue and job failing

vino yang
Hi Navneeth,

Did you check if the path contains in the exception is really can not be found?

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月3日周五 上午8:23写道:
Hi All,

We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 

image.png

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoints issue and job failing

Congxian Qiu
Hi

Do you have ever check that this problem exists on Flink 1.9?

Best,
Congxian


vino yang <[hidden email]> 于2020年1月3日周五 下午3:54写道:
Hi Navneeth,

Did you check if the path contains in the exception is really can not be found?

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月3日周五 上午8:23写道:
Hi All,

We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 

image.png

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoints issue and job failing

Navneeth Krishnan
Thanks Congxian & Vino.

Yes, the file do exist and I don't see any problem in accessing it.

Regarding flink 1.9, we haven't migrated yet but we are planning to do. Since we have to test it might take sometime.

Thanks

On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <[hidden email]> wrote:
Hi

Do you have ever check that this problem exists on Flink 1.9?

Best,
Congxian


vino yang <[hidden email]> 于2020年1月3日周五 下午3:54写道:
Hi Navneeth,

Did you check if the path contains in the exception is really can not be found?

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月3日周五 上午8:23写道:
Hi All,

We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 

image.png

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoints issue and job failing

vino yang
Hi Navneeth,

Since the file still exists, this exception is very strange.

I want to ask, does it happen by accident or frequently?

Another concern is that since the 1.4 version is very far away, all maintenance and response are not as timely as the recent versions. I personally recommend upgrading as soon as possible.

I can ping [hidden email]  and see if it is possible to explain the cause of this problem.

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月4日周六 上午1:03写道:
Thanks Congxian & Vino.

Yes, the file do exist and I don't see any problem in accessing it.

Regarding flink 1.9, we haven't migrated yet but we are planning to do. Since we have to test it might take sometime.

Thanks

On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <[hidden email]> wrote:
Hi

Do you have ever check that this problem exists on Flink 1.9?

Best,
Congxian


vino yang <[hidden email]> 于2020年1月3日周五 下午3:54写道:
Hi Navneeth,

Did you check if the path contains in the exception is really can not be found?

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月3日周五 上午8:23写道:
Hi All,

We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 

image.png

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoints issue and job failing

Piotr Nowojski-3
Hi,

From the top of my head I don’t remember anything particular, however release 1.4.0 came with quite a lot of deep change which had it’s fair share number of bugs, that were subsequently fixed in later releases. 

Because 1.4.x tree is no longer supported I would strongly recommend to first upgrade to a more recent Flink version. If that’s not possible, I would at least upgrade to the latest release from 1.4.x tree (1.4.2).

Piotrek

On 6 Jan 2020, at 07:25, vino yang <[hidden email]> wrote:

Hi Navneeth,

Since the file still exists, this exception is very strange.

I want to ask, does it happen by accident or frequently?

Another concern is that since the 1.4 version is very far away, all maintenance and response are not as timely as the recent versions. I personally recommend upgrading as soon as possible.

I can ping [hidden email]  and see if it is possible to explain the cause of this problem.

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月4日周六 上午1:03写道:
Thanks Congxian & Vino.

Yes, the file do exist and I don't see any problem in accessing it.

Regarding flink 1.9, we haven't migrated yet but we are planning to do. Since we have to test it might take sometime.

Thanks

On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <[hidden email]> wrote:
Hi

Do you have ever check that this problem exists on Flink 1.9?

Best,
Congxian


vino yang <[hidden email]> 于2020年1月3日周五 下午3:54写道:
Hi Navneeth,

Did you check if the path contains in the exception is really can not be found?

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月3日周五 上午8:23写道:
Hi All,

We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 

<image.png>

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Checkpoints issue and job failing

Navneeth Krishnan
Thanks Vino & Piotr,

sure, will upgrade the flink version and monitor it to see if the problem still exist. 

Thanks

On Mon, Jan 6, 2020 at 12:39 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

From the top of my head I don’t remember anything particular, however release 1.4.0 came with quite a lot of deep change which had it’s fair share number of bugs, that were subsequently fixed in later releases. 

Because 1.4.x tree is no longer supported I would strongly recommend to first upgrade to a more recent Flink version. If that’s not possible, I would at least upgrade to the latest release from 1.4.x tree (1.4.2).

Piotrek

On 6 Jan 2020, at 07:25, vino yang <[hidden email]> wrote:

Hi Navneeth,

Since the file still exists, this exception is very strange.

I want to ask, does it happen by accident or frequently?

Another concern is that since the 1.4 version is very far away, all maintenance and response are not as timely as the recent versions. I personally recommend upgrading as soon as possible.

I can ping [hidden email]  and see if it is possible to explain the cause of this problem.

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月4日周六 上午1:03写道:
Thanks Congxian & Vino.

Yes, the file do exist and I don't see any problem in accessing it.

Regarding flink 1.9, we haven't migrated yet but we are planning to do. Since we have to test it might take sometime.

Thanks

On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu <[hidden email]> wrote:
Hi

Do you have ever check that this problem exists on Flink 1.9?

Best,
Congxian


vino yang <[hidden email]> 于2020年1月3日周五 下午3:54写道:
Hi Navneeth,

Did you check if the path contains in the exception is really can not be found?

Best,
Vino

Navneeth Krishnan <[hidden email]> 于2020年1月3日周五 上午8:23写道:
Hi All,

We are running into checkpoint timeout issue more frequently in production and we also see the below exception. We are running flink 1.4.0 and the checkpoints are saved on NFS. Can someone suggest how to overcome this? 

<image.png>

java.lang.IllegalStateException: Could not initialize operator state backend.
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

Thanks