(DEPRECATED) Apache Flink User Mailing List archive.

Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Classic

List

Threaded

9 messages Options

Jamie Grier-2

Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Anybody else seen this? I'm running both the JM and TM on the same host in this setup. This was working fine w/ Flink 1.5.3.

On the TaskManager:

00:31:30.268 INFO o.a.f.r.t.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink@localhost:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@localhost:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..

On the JobManager:

00:32:00.339 ERROR a.r.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@localhost:6123/]] arriving at [akka.tcp://flink@localhost:6123] inbound addresses are [akka.tcp://flink@cluster:6123]

alex

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Jamie Grier-2

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4.

On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote:

We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Jamie Grier-2

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Update on this:

The issue was the command being used to start the jobmanager: `jobmanager.sh start-foreground cluster`. This was a command leftover in our automation that used to be the correct way to start the JM -- however now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted as the hostname for the jobmanager to bind to.

The solution was just to remove `cluster` from that command.

On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <[hidden email]> wrote:

Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4.

On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote:
We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Till Rohrmann

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Hi Jamie,

thanks for the update on how to fix the problem. This is very helpful for the rest of the community.

The change of removing the execution mode parameter (FLINK-8696) from the start up scripts was actually released with Flink 1.5.0. That way, the host name became the 2nd parameter. By calling the start up scripts with the old syntax, the execution mode parameter was interpreted as the hostname. This host name option was, however, not properly evaluated until we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now.

We definitely need to treat the start up scripts as a stable API as well. So far, we don't have good tooling which ensures that we don't introduce breaking changes. In the future we need to be more careful!

Cheers,

Till

On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <[hidden email]> wrote:

Update on this:

The issue was the command being used to start the jobmanager: `jobmanager.sh start-foreground cluster`. This was a command leftover in our automation that used to be the correct way to start the JM -- however now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted as the hostname for the jobmanager to bind to.

The solution was just to remove `cluster` from that command.

On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <[hidden email]> wrote:
Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4.

On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote:
We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

rmetzger0

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Hey Jamie,

we've been facing the same issue with dA Platform, when running Flink 1.6.1.

I assume a lot of people will be affected by this.

On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <[hidden email]> wrote:

Hi Jamie,

thanks for the update on how to fix the problem. This is very helpful for the rest of the community.

The change of removing the execution mode parameter (FLINK-8696) from the start up scripts was actually released with Flink 1.5.0. That way, the host name became the 2nd parameter. By calling the start up scripts with the old syntax, the execution mode parameter was interpreted as the hostname. This host name option was, however, not properly evaluated until we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now.

We definitely need to treat the start up scripts as a stable API as well. So far, we don't have good tooling which ensures that we don't introduce breaking changes. In the future we need to be more careful!

Cheers,
Till

On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <[hidden email]> wrote:
Update on this:

The issue was the command being used to start the jobmanager: `jobmanager.sh start-foreground cluster`. This was a command leftover in our automation that used to be the correct way to start the JM -- however now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted as the hostname for the jobmanager to bind to.

The solution was just to remove `cluster` from that command.

On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <[hidden email]> wrote:
Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4.

On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote:
We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Fabian Hueske-2

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Should we add a warning to the release announcements?

Fabian

Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger <[hidden email]>:

Hey Jamie,

we've been facing the same issue with dA Platform, when running Flink 1.6.1.
I assume a lot of people will be affected by this.

On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <[hidden email]> wrote:
Hi Jamie,

thanks for the update on how to fix the problem. This is very helpful for the rest of the community.

The change of removing the execution mode parameter (FLINK-8696) from the start up scripts was actually released with Flink 1.5.0. That way, the host name became the 2nd parameter. By calling the start up scripts with the old syntax, the execution mode parameter was interpreted as the hostname. This host name option was, however, not properly evaluated until we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now.

We definitely need to treat the start up scripts as a stable API as well. So far, we don't have good tooling which ensures that we don't introduce breaking changes. In the future we need to be more careful!

Cheers,
Till

On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <[hidden email]> wrote:
Update on this:

The issue was the command being used to start the jobmanager: `jobmanager.sh start-foreground cluster`. This was a command leftover in our automation that used to be the correct way to start the JM -- however now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted as the hostname for the jobmanager to bind to.

The solution was just to remove `cluster` from that command.

On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <[hidden email]> wrote:
Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4.

On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote:
We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Till Rohrmann

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

Yes, that would be a good idea. I think it should go into the release notes. Will add it.

On Wed, Sep 26, 2018 at 10:24 AM Fabian Hueske <[hidden email]> wrote:

Should we add a warning to the release announcements?

Fabian

Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger <[hidden email]>:
Hey Jamie,

we've been facing the same issue with dA Platform, when running Flink 1.6.1.
I assume a lot of people will be affected by this.

On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <[hidden email]> wrote:
Hi Jamie,

thanks for the update on how to fix the problem. This is very helpful for the rest of the community.

The change of removing the execution mode parameter (FLINK-8696) from the start up scripts was actually released with Flink 1.5.0. That way, the host name became the 2nd parameter. By calling the start up scripts with the old syntax, the execution mode parameter was interpreted as the hostname. This host name option was, however, not properly evaluated until we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now.

We definitely need to treat the start up scripts as a stable API as well. So far, we don't have good tooling which ensures that we don't introduce breaking changes. In the future we need to be more careful!

Cheers,
Till

On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <[hidden email]> wrote:
Update on this:

The issue was the command being used to start the jobmanager: `jobmanager.sh start-foreground cluster`. This was a command leftover in our automation that used to be the correct way to start the JM -- however now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted as the hostname for the jobmanager to bind to.

The solution was just to remove `cluster` from that command.

On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <[hidden email]> wrote:
Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4.

On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote:
We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Till Rohrmann

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

What do you think about reverting this change (FLINK-8696), because it is really hard to debug for users? A problem would be if people now rely on the second argument being the hostname.

An alternative could be to filter out `cluster` and `local` if they should appear as second argument. This could however lead to problems if a user wants to set the hostname to either `local` or `cluster` via jobmanager.sh.

Cheers,

Till

On Wed, Sep 26, 2018 at 11:24 AM Till Rohrmann <[hidden email]> wrote:

Yes, that would be a good idea. I think it should go into the release notes. Will add it.

On Wed, Sep 26, 2018 at 10:24 AM Fabian Hueske <[hidden email]> wrote:
Should we add a warning to the release announcements?

Fabian

Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger <[hidden email]>:
Hey Jamie,

we've been facing the same issue with dA Platform, when running Flink 1.6.1.
I assume a lot of people will be affected by this.

On Tue, Sep 25, 2018 at 11:18 PM Till Rohrmann <[hidden email]> wrote:
Hi Jamie,

thanks for the update on how to fix the problem. This is very helpful for the rest of the community.

The change of removing the execution mode parameter (FLINK-8696) from the start up scripts was actually released with Flink 1.5.0. That way, the host name became the 2nd parameter. By calling the start up scripts with the old syntax, the execution mode parameter was interpreted as the hostname. This host name option was, however, not properly evaluated until we fixed it with Flink 1.5.4. Therefore, the problem only surfaced now.

We definitely need to treat the start up scripts as a stable API as well. So far, we don't have good tooling which ensures that we don't introduce breaking changes. In the future we need to be more careful!

Cheers,
Till

On Tue, Sep 25, 2018 at 8:54 PM Jamie Grier <[hidden email]> wrote:
Update on this:

The issue was the command being used to start the jobmanager: `jobmanager.sh start-foreground cluster`. This was a command leftover in our automation that used to be the correct way to start the JM -- however now, in Flink 1.5.4, that second parameter, `cluster`, is being interpreted as the hostname for the jobmanager to bind to.

The solution was just to remove `cluster` from that command.

On Tue, Sep 25, 2018 at 10:15 AM Jamie Grier <[hidden email]> wrote:
Anybody else seen this and know the solution? We're dead in the water with Flink 1.5.4.

On Sun, Sep 23, 2018 at 11:46 PM alex <[hidden email]> wrote:
We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/