(DEPRECATED) Apache Flink User Mailing List archive.

TaskManager gets confused after the JobManager restarts

Classic

List

Threaded

4 messages Options

Ethan Li

TaskManager gets confused after the JobManager restarts

Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper. I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager <a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)

2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA

2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (<a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager), starting network stack and library cache.

2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.

2019-02-14 14:56:54,278 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858

2019-02-14 14:56:54,279 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a

2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - TaskManager was triggered to register at JobManager, but is already registered

2019-02-14 14:56:56,359 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213

2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - The association error event's root cause is not of type InvalidAssociationException.

Full Task manage log: https://gist.github.com/Ethanlm/e6f1b29d27d26813f5f8f40cd2c12643

Is this expected or is this a bug?

Thank you!

Ethan

Ethan Li

Re: TaskManager gets confused after the JobManager restarts

The related job manager log is https://gist.github.com/Ethanlm/86a10e786ad9025ddaa27c113c536da8

On Feb 14, 2019, at 9:40 AM, Ethan Li <[hidden email]> wrote:

Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper. I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager <a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA
2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (<a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager), starting network stack and library cache.
2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.
2019-02-14 14:56:54,278 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858
2019-02-14 14:56:54,279 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a
2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - TaskManager was triggered to register at JobManager, but is already registered
2019-02-14 14:56:56,359 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213
2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - The association error event's root cause is not of type InvalidAssociationException.

Full Task manage log: https://gist.github.com/Ethanlm/e6f1b29d27d26813f5f8f40cd2c12643

Is this expected or is this a bug?

Thank you!

Ethan

Till Rohrmann

Re: TaskManager gets confused after the JobManager restarts

Hi Ethan,

can you observe a similar behaviour with Flink 1.7.1? Flink 1.4.2 is no longer supported by the community.

Cheers,

Till

On Thu, Feb 14, 2019 at 5:06 PM Ethan Li <[hidden email]> wrote:

The related job manager log is https://gist.github.com/Ethanlm/86a10e786ad9025ddaa27c113c536da8

On Feb 14, 2019, at 9:40 AM, Ethan Li <[hidden email]> wrote:

Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper. I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.ssl.tcp://[hidden email]:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA
2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.ssl.tcp://[hidden email]:34561/user/jobmanager), starting network stack and library cache.
2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.
2019-02-14 14:56:54,278 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858
2019-02-14 14:56:54,279 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a
2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - TaskManager was triggered to register at JobManager, but is already registered
2019-02-14 14:56:56,359 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213
2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - The association error event's root cause is not of type InvalidAssociationException.

Full Task manage log: https://gist.github.com/Ethanlm/e6f1b29d27d26813f5f8f40cd2c12643

Is this expected or is this a bug?

Thank you!

Ethan

Ethan Li

Re: TaskManager gets confused after the JobManager restarts

Hi Till,

I will have to test it with flink 1.7.1 and get back to you. Thanks!

Best,

Ethan

On Feb 15, 2019, at 4:01 AM, Till Rohrmann <[hidden email]> wrote:

Hi Ethan,

can you observe a similar behaviour with Flink 1.7.1? Flink 1.4.2 is no longer supported by the community.

Cheers,
Till

On Thu, Feb 14, 2019 at 5:06 PM Ethan Li <[hidden email]> wrote:
The related job manager log is https://gist.github.com/Ethanlm/86a10e786ad9025ddaa27c113c536da8

On Feb 14, 2019, at 9:40 AM, Ethan Li <[hidden email]> wrote:

Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper. I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.ssl.tcp://[hidden email]:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA
2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.ssl.tcp://[hidden email]:34561/user/jobmanager), starting network stack and library cache.
2019-02-14 14:56:54,276 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.
2019-02-14 14:56:54,278 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858
2019-02-14 14:56:54,279 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a
2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - TaskManager was triggered to register at JobManager, but is already registered
2019-02-14 14:56:56,359 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213
2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager - The association error event's root cause is not of type InvalidAssociationException.

Full Task manage log: https://gist.github.com/Ethanlm/e6f1b29d27d26813f5f8f40cd2c12643

Is this expected or is this a bug?

Thank you!

Ethan