TaskManager gets confused after the JobManager restarts

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

TaskManager gets confused after the JobManager restarts

Ethan Li
Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper.  I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager <a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler  - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (<a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager), starting network stack and library cache.
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.
2019-02-14 14:56:54,278 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858
2019-02-14 14:56:54,279 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a
2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager was triggered to register at JobManager, but is already registered
2019-02-14 14:56:56,359 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213
2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - The association error event's root cause is not of type InvalidAssociationException.





Is this expected or is this a bug? 

Thank you!

Ethan
Reply | Threaded
Open this post in threaded view
|

Re: TaskManager gets confused after the JobManager restarts

Ethan Li
The related job manager log is https://gist.github.com/Ethanlm/86a10e786ad9025ddaa27c113c536da8

On Feb 14, 2019, at 9:40 AM, Ethan Li <[hidden email]> wrote:

Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper.  I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager <a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler  - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (<a href="akka.ssl.tcp://flink@openstorm10blue-n1.blue.ygrid.yahoo.com:34561/user/jobmanager" class="">akka.ssl.tcp://flink@...:34561/user/jobmanager), starting network stack and library cache.
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.
2019-02-14 14:56:54,278 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858
2019-02-14 14:56:54,279 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a
2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager was triggered to register at JobManager, but is already registered
2019-02-14 14:56:56,359 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213
2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - The association error event's root cause is not of type InvalidAssociationException.





Is this expected or is this a bug? 

Thank you!

Ethan

Reply | Threaded
Open this post in threaded view
|

Re: TaskManager gets confused after the JobManager restarts

Till Rohrmann
Hi Ethan,

can you observe a similar behaviour with Flink 1.7.1? Flink 1.4.2 is no longer supported by the community.

Cheers,
Till

On Thu, Feb 14, 2019 at 5:06 PM Ethan Li <[hidden email]> wrote:
The related job manager log is https://gist.github.com/Ethanlm/86a10e786ad9025ddaa27c113c536da8

On Feb 14, 2019, at 9:40 AM, Ethan Li <[hidden email]> wrote:

Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper.  I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.ssl.tcp://[hidden email]:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler  - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (akka.ssl.tcp://[hidden email]:34561/user/jobmanager), starting network stack and library cache.
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.
2019-02-14 14:56:54,278 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858
2019-02-14 14:56:54,279 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a
2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager was triggered to register at JobManager, but is already registered
2019-02-14 14:56:56,359 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213
2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - The association error event's root cause is not of type InvalidAssociationException.





Is this expected or is this a bug? 

Thank you!

Ethan

Reply | Threaded
Open this post in threaded view
|

Re: TaskManager gets confused after the JobManager restarts

Ethan Li
Hi Till,

I will have to test it with flink 1.7.1 and get back to you. Thanks!

Best,
Ethan


On Feb 15, 2019, at 4:01 AM, Till Rohrmann <[hidden email]> wrote:

Hi Ethan,

can you observe a similar behaviour with Flink 1.7.1? Flink 1.4.2 is no longer supported by the community.

Cheers,
Till

On Thu, Feb 14, 2019 at 5:06 PM Ethan Li <[hidden email]> wrote:
The related job manager log is https://gist.github.com/Ethanlm/86a10e786ad9025ddaa27c113c536da8

On Feb 14, 2019, at 9:40 AM, Ethan Li <[hidden email]> wrote:

Hello,

I have a standalone flink-1.4.2 cluster with one JobManager, one TaskManager, and zookeeper.  I first started JM and TM and waited for them to be stable. Then I restarted JM. It’s when the TM got confused.

TM got notified that Leader node has changed and it tried to register to the new Leader (the new rpc port is 34561). Then it got the acknowledge says it’s already registered. And it then kept trying to associate with the old JM roc port (35213) and fail.

2019-02-14 14:56:54,059 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.ssl.tcp://[hidden email]:34561/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2019-02-14 14:56:54,157 DEBUG org.apache.flink.shaded.akka.org.jboss.netty.handler.ssl.SslHandler  - [id: 0x77ac93ae, /10.215.68.243:46796 => openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:34561] HANDSHAKEN: TLS_RSA_WITH_AES_128_CBC_SHA
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (akka.ssl.tcp://[hidden email]:34561/user/jobmanager), starting network stack and library cache.
2019-02-14 14:56:54,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:50100. Starting BLOB cache.
2019-02-14 14:56:54,278 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-927b523f-f3ff-4ccc-83a0-362e09a3b858
2019-02-14 14:56:54,279 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /home/y/var/flink/blobstorage/blobStore-8492465e-0e94-4792-a346-66e6da299f7a
2019-02-14 14:56:54,572 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager was triggered to register at JobManager, but is already registered
2019-02-14 14:56:56,359 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: openstorm10blue-n1.blue.ygrid.yahoo.com/10.215.68.98:35213
2019-02-14 14:56:56,360 DEBUG org.apache.flink.runtime.taskmanager.TaskManager              - The association error event's root cause is not of type InvalidAssociationException.





Is this expected or is this a bug? 

Thank you!

Ethan