Re: ScannerTimeout over long running process

Posted by Flavio Pompermaier on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/ScannerTimeout-over-long-running-process-tp483p485.html

Could it be that there are times in the TaskManager where there are large pauses between an inputFormat.nextRecord() and the next one..?

On Thu, Nov 27, 2014 at 3:44 PM, Stefano Bortoli <[hidden email]> wrote:
hi all,

I am facing an odd issue while running a quite complex duplicates detection process.

The code runs like a charm on a dataset of a million with few duplicates (3 minutes), but hits the scanner timeout over a dataset of 9.2M.

The problem happens randomly, and I don't think it is related to the business logic, or the scan configurations for what matters.

The caching block is set to 100, and the scan timeout is 900.000 milliseconds (15min). The job would run normally in around 0.5 seconds on a 100 entries... therefore I must be hitting something deep. Something related on how Hadoop and Hbase work together.

My problem is that it may fail or it may not. Yesterday I could complete the whole scan without problems, the the job failed over another error. Today, the same code failed after 3.5h, a little before completion of the first phase.

I think it may be something about GC.

I log the execution time of every single map, and everything finishes within milliseconds. Even then the exception happens. (as I catch it, print, and throw it again).

Any idea of where the issue could be?

thanks a lot for the support. Stack trace appended.

saluti,
Stefano

Error: org.apache.hadoop.hbase.client.ScannerTimeoutException: 2387347ms passed since the last invocation, timeout is currently set to 900000
at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:352)
at org.apache.flink.addons.hbase.TableInputFormat.nextRecord(TableInputFormat.java:106)
at org.apache.flink.addons.hbase.TableInputFormat.nextRecord(TableInputFormat.java:48)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:195)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:246)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 291, already closed?
at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3043)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
at java.lang.Thread.run(Thread.java:745)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:283)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:198)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:57)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:90)
at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:336)
... 5 more
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.UnknownScannerException): org.apache.hadoop.hbase.UnknownScannerException: Name: 291, already closed?
at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3043)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
at java.lang.Thread.run(Thread.java:745)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1458)
at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1662)
at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1720)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:29900)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:168)