Job manager failing because Flink does not find checkpoints on HDFS

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Job manager failing because Flink does not find checkpoints on HDFS

Farouk
Hi

We have Flink running on Kubernetes with HDFS. The JM crashed for some reasons.

Has anybody already encounter an error like in the logfile attached ?

Caused by: java.lang.Exception: Cannot set up the user code libraries: File does not exist: /projects/dev/flink-recovery/default/blob/job_c9642e4287d7075b53922fba162665d0/blob_p-0debc7cbf567a71ea6c8fc3efb5855aa6617fdea-0e5d3178b26aef6112fed55559d41634
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)

Thanks
Farouk

error.log (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Job manager failing because Flink does not find checkpoints on HDFS

Zhu Zhu
Hi Farouk,

This issue does not relate to checkpoints. The JM launching fails due to the job's user jar blob is missing on HDFS.
Does this issue always happen? If it rarely occurs, the file might be unexpectedly deleted by someone else.

Thanks,
Zhu Zhu

Farouk <[hidden email]> 于2019年8月1日周四 下午5:22写道:
Hi

We have Flink running on Kubernetes with HDFS. The JM crashed for some reasons.

Has anybody already encounter an error like in the logfile attached ?

Caused by: java.lang.Exception: Cannot set up the user code libraries: File does not exist: /projects/dev/flink-recovery/default/blob/job_c9642e4287d7075b53922fba162665d0/blob_p-0debc7cbf567a71ea6c8fc3efb5855aa6617fdea-0e5d3178b26aef6112fed55559d41634
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)

Thanks
Farouk
Reply | Threaded
Open this post in threaded view
|

Re: Job manager failing because Flink does not find checkpoints on HDFS

Farouk
Hi all

I am sorry. We found out that's it's a problem in our deployment.

The directories in Zookeeper and HDFS are not the same.

Thanks for the help

Farouk

Le jeu. 1 août 2019 à 11:38, Zhu Zhu <[hidden email]> a écrit :
Hi Farouk,

This issue does not relate to checkpoints. The JM launching fails due to the job's user jar blob is missing on HDFS.
Does this issue always happen? If it rarely occurs, the file might be unexpectedly deleted by someone else.

Thanks,
Zhu Zhu

Farouk <[hidden email]> 于2019年8月1日周四 下午5:22写道:
Hi

We have Flink running on Kubernetes with HDFS. The JM crashed for some reasons.

Has anybody already encounter an error like in the logfile attached ?

Caused by: java.lang.Exception: Cannot set up the user code libraries: File does not exist: /projects/dev/flink-recovery/default/blob/job_c9642e4287d7075b53922fba162665d0/blob_p-0debc7cbf567a71ea6c8fc3efb5855aa6617fdea-0e5d3178b26aef6112fed55559d41634
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)

Thanks
Farouk