Anybody else seen this? I'm running both the JM and TM on the same host in this setup. This was working fine w/ Flink 1.5.3. On the TaskManager: 00:31:30.268 INFO o.a.f.r.t.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink@localhost:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@localhost:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify".. On the JobManager: 00:32:00.339 ERROR a.r.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@localhost:6123/]] arriving at [akka.tcp://flink@localhost:6123] inbound addresses are [akka.tcp://flink@cluster:6123] |
We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers sometimes are loosing connection to JM and having following error like you have. *2018-09-19 12:36:40,687 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..* When TM started to have "Could not resolve ResourceManager", it cannot resolve itself until I restart the TM pod. *Here is the content of our flink-conf.yaml:* blob.server.port: 6124 jobmanager.rpc.address: flink-jobmanager jobmanager.rpc.port: 6123 jobmanager.heap.mb: 4096 jobmanager.web.history: 20 jobmanager.archive.fs.dir: s3://our_path taskmanager.rpc.port: 6121 taskmanager.heap.mb: 16384 taskmanager.numberOfTaskSlots: 10 taskmanager.log.path: /opt/flink/log/output.log web.log.path: /opt/flink/log/output.log state.checkpoints.num-retained: 3 metrics.reporters: prom metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter high-availability: zookeeper high-availability.jobmanager.port: 50002 high-availability.zookeeper.quorum: zookeeper_instance_list high-availability.zookeeper.path.root: /flink high-availability.cluster-id: profileservice high-availability.storageDir: s3://our_path Any help will be greatly appreciated! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4. On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote: We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We |
Update on this: The issue was the command being used to start the jobmanager: `jobmanager.sh start-foreground cluster`. This was a command leftover in our automation that used to be the correct way to start the JM -- however now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted as the hostname for the jobmanager to bind to. The solution was just to remove `cluster` from that command. On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <[hidden email]> wrote:
|
Hi Jamie, thanks for the update on how to fix the problem. This is very helpful for the rest of the community. The change of removing the execution mode parameter (FLINK-8696) from the start up scripts was actually released with Flink 1.5.0. That way, the host name became the 2nd parameter. By calling the start up scripts with the old syntax, the execution mode parameter was interpreted as the hostname. This host name option was, however, not properly evaluated until we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now. We definitely need to treat the start up scripts as a stable API as well. So far, we don't have good tooling which ensures that we don't introduce breaking changes. In the future we need to be more careful! Cheers, Till On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <[hidden email]> wrote:
|
Hey Jamie, we've been facing the same issue with dA Platform, when running Flink 1.6.1. I assume a lot of people will be affected by this. On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <[hidden email]> wrote:
|
Should we add a warning to the release announcements? Fabian Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger <[hidden email]>:
|
Yes, that would be a good idea. I think it should go into the release notes. Will add it. On Wed, Sep 26, 2018 at 10:24 AM Fabian Hueske <[hidden email]> wrote:
|
What do you think about reverting this change (FLINK-8696), because it is really hard to debug for users? A problem would be if people now rely on the second argument being the hostname. An alternative could be to filter out `cluster` and `local` if they should appear as second argument. This could however lead to problems if a user wants to set the hostname to either `local` or `cluster` via jobmanager.sh. Cheers, Till On Wed, Sep 26, 2018 at 11:24 AM Till Rohrmann <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |