Hello,
I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper. I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused. TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail. 2019-02-14 14:56:54,059 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager <a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds) 2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA 2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (<a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager), starting network stack and library cache. 2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache. 2019-02-14 14:56:54,278 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858 2019-02-14 14:56:54,279 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a 2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - TaskManager was triggered to register at JobManager, but is already registered 2019-02-14 14:56:56,359 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213 2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - The association error event's root cause is not of type InvalidAssociationException. Full Task manage log: https://gist.github.com/Ethanlm/e6f1b29d27d26813f5f8f40cd2c12643 Is this expected or is this a bug? Thank you! Ethan
|
The related job manager log is https://gist.github.com/Ethanlm/86a10e786ad9025ddaa27c113c536da8
|
Hi Ethan, can you observe a similar behaviour with Flink 1.7.1? Flink 1.4.2 is no longer supported by the community. Cheers, Till On Thu, Feb 14, 2019 at 5:06 PM Ethan Li <[hidden email]> wrote:
|
Hi Till,
I will have to test it with flink 1.7.1 and get back to you. Thanks! Best, Ethan
|
Free forum by Nabble | Edit this page |