IP resolution for metrics on k8s when the JM ( job cluster ) is rolled but TMs are not

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

IP resolution for metrics on k8s when the JM ( job cluster ) is rolled but TMs are not

Vishal Santoshi
Scenerio 

* savepoint with Cancel followed by a restore on the Job. It brings down the JM and relaunches on a different IP, thus the resolution of dns is a new IP.
* The TMs deployment is not rolled ( recreated ) 
* Note that `flink-conf.yaml:metrics.internal.query-service.port` is hardcoded.




Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: [dns]/172.17.6.135:6666

Solution: Restart the TM deployment ( though that should not be and will cause latency issues on a shared Resource Manager as k8s ) 

PS I am sure that a cancel/restart or restart of JM b'coz of any issue will create the same above issue ( not tested ) .



Regards