Hey all,
So we are doing some experimenting around large keyed state in Flink 1.2 on a single task manager and we keep having our task manager killed by the job manager after about 10 minutes due to this exception: akka.pattern. The task manager logs show nothing out of the ordinary, but the job manager logs shows this: 2017-04-19 20:56:52,230 Association with remote system [akka.tcp://flink@flink-s- 2017-04-19 20:56:53,986 Fetching metrics failed. 2017-04-19 20:57:43,584 Association with remote system [akka.tcp://flink@flink-s- 2017-04-19 20:57:49,517 Detected unreachable: [akka.tcp://flink@flink-s- 2017-04-19 20:57:49,517 Task manager akka.tcp://flink@flink-s-load- The weird part is, we have not set up any metrics reporters or anything so I am not really sure why the Job Manager is asking the task manager about them. Is there a way to disable these metrics requests, or does anyone know what is causing them? Thanks, -- Jason Brelloch | Product Developer Subscribe to the BetterCloud Monitor - Get IT delivered to your inbox |
Hello,
the MetricQueryService is used by the webUI to fetch fetch metrics from the JobManager and all TaskManagers. It is only used when the webUI is accessed. Based on the logs you gave the TaskManager isn't killed by the JobManager; instead the JobManager only detected that the TaskManager has shut down. It is highly unlikely that the MetricQueryService is the cause of this; the exception you are seeing is due to the TaskManager being no longer reachable. Can't fetch metrics when the TaskManager isn't there anymore. How do you mange the Flink cluster? (Yarn etc.) Given that no exception appears in the log i would assume that the TaskManager JVM was killed from the outside. Regards, Chesnay On 20.04.2017 18:42, Jason Brelloch wrote:
|
Free forum by Nabble | Edit this page |