FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Adrian Vasiliu
Hello, 
 
We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we experience repeated failing jobs with 
 
java.lang.RuntimeException: Could not create file for checking if truncate works. You can disable support for truncate() completely via BucketingSink.setUseTruncate(false).
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
 
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
    at org.apache.hadoop.ipc.Client.call(Client.java:1435)
    at org.apache.hadoop.ipc.Client.call(Client.java:1345)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
    at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
    at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
 
Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it looks related but this is marked as fixed in 1.9.
 
Then, the discussion there points to https://issues.apache.org/jira/browse/FLINK-13497 which is marked as unresolved / fixed in 1.10.

Any lights about:
1/ Would you confirm that our stack trace is related with  https://issues.apache.org/jira/browse/FLINK-13497  ?
2/ Any ETA for a 1.9.x fixing it?
 
Thanks
Adrian Vasiliu

Reply | Threaded
Open this post in threaded view
|

Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Congxian Qiu
Hi

From the given stack trace, maybe you could solve the "replication problem" first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation, and maybe the answer from SO[1] can help.


Adrian Vasiliu <[hidden email]> 于2019年10月14日周一 下午9:10写道:
Hello, 
 
We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we experience repeated failing jobs with 
 
java.lang.RuntimeException: Could not create file for checking if truncate works. You can disable support for truncate() completely via BucketingSink.setUseTruncate(false).
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
 
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
    at org.apache.hadoop.ipc.Client.call(Client.java:1435)
    at org.apache.hadoop.ipc.Client.call(Client.java:1345)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
    at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
    at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
 
Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it looks related but this is marked as fixed in 1.9.
 
Then, the discussion there points to https://issues.apache.org/jira/browse/FLINK-13497 which is marked as unresolved / fixed in 1.10.

Any lights about:
1/ Would you confirm that our stack trace is related with  https://issues.apache.org/jira/browse/FLINK-13497  ?
2/ Any ETA for a 1.9.x fixing it?
 
Thanks
Adrian Vasiliu

Reply | Threaded
Open this post in threaded view
|

RE: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Adrian Vasiliu
Thanks Congxian. The possible causes listed in the mostly voted answer of https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025 do not seem to hold for us, because we have other pretty much similar flink jobs using the same Hadoop server and root directory (under different hdfs paths), and they do work. Thus in principle the config on the Hadoop server-side wouldn't be the cause. Also, according to the Ambari monitoring tools, the Hadoop server is healthy, and we did restart it. However, we'll check all points mentioned in various answers, in particular the one about temp files.
Thanks
Adrian
 
----- Original message -----
From: Congxian Qiu <[hidden email]>
To: Adrian Vasiliu <[hidden email]>
Cc: user <[hidden email]>
Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS
Date: Tue, Oct 15, 2019 4:02 AM
 
Hi
 
From the given stack trace, maybe you could solve the "replication problem" first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation, and maybe the answer from SO[1] can help.
 
 
Adrian Vasiliu <[hidden email]> 于2019年10月14日周一 下午9:10写道:
Hello, 
 
We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we experience repeated failing jobs with 
 
java.lang.RuntimeException: Could not create file for checking if truncate works. You can disable support for truncate() completely via BucketingSink.setUseTruncate(false).
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
 
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
    at org.apache.hadoop.ipc.Client.call(Client.java:1435)
    at org.apache.hadoop.ipc.Client.call(Client.java:1345)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
    at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
    at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
 
Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it looks related but this is marked as fixed in 1.9.
 
Then, the discussion there points to https://issues.apache.org/jira/browse/FLINK-13497 which is marked as unresolved / fixed in 1.10.

Any lights about:
1/ Would you confirm that our stack trace is related with  https://issues.apache.org/jira/browse/FLINK-13497  ?
2/ Any ETA for a 1.9.x fixing it?
 
Thanks
Adrian Vasiliu
 

Reply | Threaded
Open this post in threaded view
|

RE: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Adrian Vasiliu
Hi,
FYI we've switched to a different Hadoop server, and the issue vanished... It does look as the cause was on hadoop side. 
Thanks again Congxian.
Adrian 
 
----- Original message -----
From: "Adrian Vasiliu" <[hidden email]>
To: [hidden email]
Cc: [hidden email]
Subject: [EXTERNAL] RE: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS
Date: Tue, Oct 15, 2019 8:37 AM
 
Thanks Congxian. The possible causes listed in the mostly voted answer of https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025 do not seem to hold for us, because we have other pretty much similar flink jobs using the same Hadoop server and root directory (under different hdfs paths), and they do work. Thus in principle the config on the Hadoop server-side wouldn't be the cause. Also, according to the Ambari monitoring tools, the Hadoop server is healthy, and we did restart it. However, we'll check all points mentioned in various answers, in particular the one about temp files.
Thanks
Adrian
 
----- Original message -----
From: Congxian Qiu <[hidden email]>
To: Adrian Vasiliu <[hidden email]>
Cc: user <[hidden email]>
Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS
Date: Tue, Oct 15, 2019 4:02 AM
 
Hi
 
From the given stack trace, maybe you could solve the "replication problem" first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation, and maybe the answer from SO[1] can help.
 
 
Adrian Vasiliu <[hidden email]> 于2019年10月14日周一 下午9:10写道:
Hello, 
 
We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we experience repeated failing jobs with 
 
java.lang.RuntimeException: Could not create file for checking if truncate works. You can disable support for truncate() completely via BucketingSink.setUseTruncate(false).
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
 
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
    at org.apache.hadoop.ipc.Client.call(Client.java:1435)
    at org.apache.hadoop.ipc.Client.call(Client.java:1345)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
    at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
    at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
 
Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it looks related but this is marked as fixed in 1.9.
 
Then, the discussion there points to https://issues.apache.org/jira/browse/FLINK-13497 which is marked as unresolved / fixed in 1.10.

Any lights about:
1/ Would you confirm that our stack trace is related with  https://issues.apache.org/jira/browse/FLINK-13497  ?
2/ Any ETA for a 1.9.x fixing it?
 
Thanks
Adrian Vasiliu
 
 

Reply | Threaded
Open this post in threaded view
|

Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS

Congxian Qiu
Glad to hear it! 

Best,
Congxian


Adrian Vasiliu <[hidden email]> 于2019年10月15日周二 下午9:10写道:
Hi,
FYI we've switched to a different Hadoop server, and the issue vanished... It does look as the cause was on hadoop side. 
Thanks again Congxian.
Adrian 
 
----- Original message -----
From: "Adrian Vasiliu" <[hidden email]>
To: [hidden email]
Cc: [hidden email]
Subject: [EXTERNAL] RE: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS
Date: Tue, Oct 15, 2019 8:37 AM
 
Thanks Congxian. The possible causes listed in the mostly voted answer of https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025 do not seem to hold for us, because we have other pretty much similar flink jobs using the same Hadoop server and root directory (under different hdfs paths), and they do work. Thus in principle the config on the Hadoop server-side wouldn't be the cause. Also, according to the Ambari monitoring tools, the Hadoop server is healthy, and we did restart it. However, we'll check all points mentioned in various answers, in particular the one about temp files.
Thanks
Adrian
 
----- Original message -----
From: Congxian Qiu <[hidden email]>
To: Adrian Vasiliu <[hidden email]>
Cc: user <[hidden email]>
Subject: [EXTERNAL] Re: FLINK-13497 / "Could not create file for checking if truncate works" / HDFS
Date: Tue, Oct 15, 2019 4:02 AM
 
Hi
 
From the given stack trace, maybe you could solve the "replication problem" first,   File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation, and maybe the answer from SO[1] can help.
 
 
Adrian Vasiliu <[hidden email]> 于2019年10月14日周一 下午9:10写道:
Hello, 
 
We recently upgraded our product from Flink 1.7.2 to Flink 1.9, and we experience repeated failing jobs with 
 
java.lang.RuntimeException: Could not create file for checking if truncate works. You can disable support for truncate() completely via BucketingSink.setUseTruncate(false).
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.reflectTruncate(BucketingSink.java:645)
    at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:388)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
    at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /okd-dev/3fe6b069-43bf-4d86-9762-4f501c9db16e could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3368)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3292)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:850)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:504)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
 
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
    at org.apache.hadoop.ipc.Client.call(Client.java:1435)
    at org.apache.hadoop.ipc.Client.call(Client.java:1345)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy49.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
    at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
    at com.sun.proxy.$Proxy50.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
 
Reading through https://issues.apache.org/jira/browse/FLINK-13593 , it looks related but this is marked as fixed in 1.9.
 
Then, the discussion there points to https://issues.apache.org/jira/browse/FLINK-13497 which is marked as unresolved / fixed in 1.10.

Any lights about:
1/ Would you confirm that our stack trace is related with  https://issues.apache.org/jira/browse/FLINK-13497  ?
2/ Any ETA for a 1.9.x fixing it?
 
Thanks
Adrian Vasiliu