No activity but checkpoints are failing and backpressure is high

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

No activity but checkpoints are failing and backpressure is high

Dmitry Minaev
Hi everyone,

We have a small QA environment with just one job manager and one task manager. There are several jobs running with parallelism 1.
There is a problem with one job. During our regular upgrade process one job wasn't cancelled due to savepoint timeout:

Cancelling job 1b80efe346d437c01e17b6efda640909 with savepoint to /path/to/nfsrecovery/flink-distribution.
 
------------------------------------------------------------
The program finished with the following exception:
 
java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
       at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
       at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
       at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
       at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
       at scala.concurrent.Await$.result(package.scala:190)
       at scala.concurrent.Await.result(package.scala)
       at org.apache.flink.client.program.ClusterClient.cancelWithSavepoint(ClusterClient.java:621)
       at org.apache.flink.client.CliFrontend.cancel(CliFrontend.java:628)
       at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1060)
       at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
       at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
       at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
       at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
 
So we ended up with 2 similar jobs running in parallel (not sure if it's related to the problem).

There is no activity on this environment now but I'm seeing that there is a high backpressure on one of the operators of this job. Also, all the checkpoints are failing by timeout (5 minutes) for this particular job. Other jobs are all good.

I've looked at the job manager logs and noticed that once a day we have a connection issue between JM and TM nodes:

01 Aug 2018 22:07:18,613 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:41651
01 Aug 2018 22:07:18,613 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@...:41651] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@...:41651]] Caused by: [Connection refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:41651]
02 Aug 2018 22:07:18,700 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:36539
02 Aug 2018 22:07:18,700 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@...:36539] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@...:36539]] Caused by: [Connection refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:36539]
02 Aug 2018 22:07:23,502 WARN akka.remote.Remoting - Association to [akka.tcp://flink@...:42579] with unknown UID is irrecoverably failed. Address cannot be quarantined without knowing the UID, gating instead for 5000 ms.

Other than that I don't see anything strange in the logs.

Here is the task manager's memory dump if it can help: https://drive.google.com/file/d/1T9FqY8faWHmJOPdMC0MunxxFRQbAVDjd/view?usp=sharing

I would very much appreciate any advice to help me solve the problem.

Thank you,
Dmitry Minaev
--

--
Dmitry