What's the meaning of "Registered `TaskManager` at akka://flink/deadLetters " ?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

What's the meaning of "Registered `TaskManager` at akka://flink/deadLetters " ?

Reza Samee

I'm running a flink-cluster (a mini one with just one node); but the problem is that my TaskManager can't reach to my JobManager!

Here are logs from TaskManager
...
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 20, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 21, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 22, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 23, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 24, timeout: 30 seconds)
...

My "JobManager UI" shows my TaskManager with this Path & ID: "akka://flink/deadLetters" ( in TaskManagers tab)
And I found these lines in my JobManger stdout:

Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-275619168] - leader session null
TaskManager ResourceID{resourceId='1132cbdaf2d8204e5e42e321e8592754'} has started.
Registered TaskManager at MY_PRIV_IP (akka://flink/deadLetters) as 7d9568445b4557a74d05a0771a08ad9c. Current number of registered hosts is 1. Current number of alive task slots is 20.


What's the meaning of these lines? Where should I look for the solution?




--
رضا سامعی / http://samee.blog.ir
Reply | Threaded
Open this post in threaded view
|

Re: What's the meaning of "Registered `TaskManager` at akka://flink/deadLetters " ?

Piotr Nowojski
Hi,

Search both job manager and task manager logs for ip address(es) and port(s) that have timeouted. First of all make sure that nodes are visible to each other using some simple ping. Afterwards please check that those timeouted ports are opened and not blocked by some firewall (telnet).

You can search the documentation for the configuration parameters with “port” in name:
But note that many of them are random by default.

Piotrek

On 9 Jan 2018, at 17:56, Reza Samee <[hidden email]> wrote:


I'm running a flink-cluster (a mini one with just one node); but the problem is that my TaskManager can't reach to my JobManager!

Here are logs from TaskManager
...
Trying to register at JobManager <a href="akka.tcp://flink@MY_PRIV_IP/user/jobmanager" class="">akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 20, timeout: 30 seconds)
Trying to register at JobManager <a href="akka.tcp://flink@MY_PRIV_IP/user/jobmanager" class="">akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 21, timeout: 30 seconds)
Trying to register at JobManager <a href="akka.tcp://flink@MY_PRIV_IP/user/jobmanager" class="">akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 22, timeout: 30 seconds)
Trying to register at JobManager <a href="akka.tcp://flink@MY_PRIV_IP/user/jobmanager" class="">akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 23, timeout: 30 seconds)
Trying to register at JobManager <a href="akka.tcp://flink@MY_PRIV_IP/user/jobmanager" class="">akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 24, timeout: 30 seconds)
...

My "JobManager UI" shows my TaskManager with this Path & ID: "<a href="akka://flink/deadLetters" class="">akka://flink/deadLetters" ( in TaskManagers tab)
And I found these lines in my JobManger stdout:

Resource Manager associating with leading JobManager Actor[<a href="akka://flink/user/jobmanager#-275619168" class="">akka://flink/user/jobmanager#-275619168] - leader session null
TaskManager ResourceID{resourceId='1132cbdaf2d8204e5e42e321e8592754'} has started.
Registered TaskManager at MY_PRIV_IP (<a href="akka://flink/deadLetters" class="">akka://flink/deadLetters) as 7d9568445b4557a74d05a0771a08ad9c. Current number of registered hosts is 1. Current number of alive task slots is 20.


What's the meaning of these lines? Where should I look for the solution?




--
رضا سامعی / http://samee.blog.ir

Reply | Threaded
Open this post in threaded view
|

Re: What's the meaning of "Registered `TaskManager` at akka://flink/deadLetters " ?

Reza Samee
Thanks for response;
And sorry the passed time.

The JobManager & TaskManager logged ports are open!


Is this log OK?
2018-01-15 13:40:03,455 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.16.20.18:6123/user/jobmanager:null.

When I kill task-manger, the jobmanager logs:
2018-01-15 13:32:41,419 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@stage_dbq_1:45532] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

But it will not decrement the number of available task-managers!
and when I start my signle task-manager again, it logs:

2018-01-15 13:32:52,753 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ??? (akka://flink/deadLetters) as 626846ae27a833cb094eeeb047a6a72c. Current number of registered hosts is 2. Current number of alive task slots is 40.


On Wed, Jan 10, 2018 at 11:36 AM, Piotr Nowojski <[hidden email]> wrote:
Hi,

Search both job manager and task manager logs for ip address(es) and port(s) that have timeouted. First of all make sure that nodes are visible to each other using some simple ping. Afterwards please check that those timeouted ports are opened and not blocked by some firewall (telnet).

You can search the documentation for the configuration parameters with “port” in name:
But note that many of them are random by default.

Piotrek

On 9 Jan 2018, at 17:56, Reza Samee <[hidden email]> wrote:


I'm running a flink-cluster (a mini one with just one node); but the problem is that my TaskManager can't reach to my JobManager!

Here are logs from TaskManager
...
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 20, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 21, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 22, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 23, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 24, timeout: 30 seconds)
...

My "JobManager UI" shows my TaskManager with this Path & ID: "akka://flink/deadLetters" ( in TaskManagers tab)
And I found these lines in my JobManger stdout:

Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-275619168] - leader session null
TaskManager ResourceID{resourceId='1132cbdaf2d8204e5e42e321e8592754'} has started.
Registered TaskManager at MY_PRIV_IP (akka://flink/deadLetters) as 7d9568445b4557a74d05a0771a08ad9c. Current number of registered hosts is 1. Current number of alive task slots is 20.


What's the meaning of these lines? Where should I look for the solution?




--
رضا سامعی / http://samee.blog.ir




--
رضا سامعی / http://samee.blog.ir
Reply | Threaded
Open this post in threaded view
|

Re: What's the meaning of "Registered `TaskManager` at akka://flink/deadLetters " ?

Piotr Nowojski
Hi,

Could you post full job manager and task manager logs from startup until the first signs of the problem?

Thanks, Piotrek

On 15 Jan 2018, at 11:21, Reza Samee <[hidden email]> wrote:

Thanks for response;
And sorry the passed time.

The JobManager & TaskManager logged ports are open!


Is this log OK?
2018-01-15 13:40:03,455 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.16.20.18:6123/user/jobmanager:null.

When I kill task-manger, the jobmanager logs:
2018-01-15 13:32:41,419 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [<a href="akka.tcp://flink@stage_dbq_1:45532" class="">akka.tcp://flink@stage_dbq_1:45532] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

But it will not decrement the number of available task-managers!
and when I start my signle task-manager again, it logs:

2018-01-15 13:32:52,753 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ??? (<a href="akka://flink/deadLetters" class="">akka://flink/deadLetters) as 626846ae27a833cb094eeeb047a6a72c. Current number of registered hosts is 2. Current number of alive task slots is 40.


On Wed, Jan 10, 2018 at 11:36 AM, Piotr Nowojski <[hidden email]> wrote:
Hi,

Search both job manager and task manager logs for ip address(es) and port(s) that have timeouted. First of all make sure that nodes are visible to each other using some simple ping. Afterwards please check that those timeouted ports are opened and not blocked by some firewall (telnet).

You can search the documentation for the configuration parameters with “port” in name:
But note that many of them are random by default.

Piotrek

On 9 Jan 2018, at 17:56, Reza Samee <[hidden email]> wrote:


I'm running a flink-cluster (a mini one with just one node); but the problem is that my TaskManager can't reach to my JobManager!

Here are logs from TaskManager
...
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 20, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 21, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 22, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 23, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 24, timeout: 30 seconds)
...

My "JobManager UI" shows my TaskManager with this Path & ID: "akka://flink/deadLetters" ( in TaskManagers tab)
And I found these lines in my JobManger stdout:

Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-275619168] - leader session null
TaskManager ResourceID{resourceId='1132cbdaf2d8204e5e42e321e8592754'} has started.
Registered TaskManager at MY_PRIV_IP (akka://flink/deadLetters) as 7d9568445b4557a74d05a0771a08ad9c. Current number of registered hosts is 1. Current number of alive task slots is 20.


What's the meaning of these lines? Where should I look for the solution?




--
رضا سامعی / http://samee.blog.ir




--
رضا سامعی / http://samee.blog.ir

Reply | Threaded
Open this post in threaded view
|

Re: What's the meaning of "Registered `TaskManager` at akka://flink/deadLetters " ?

Reza Samee
Hi,

I attached log file,

Thanks

On Mon, Jan 15, 2018 at 3:36 PM, Piotr Nowojski <[hidden email]> wrote:
Hi,

Could you post full job manager and task manager logs from startup until the first signs of the problem?

Thanks, Piotrek


On 15 Jan 2018, at 11:21, Reza Samee <[hidden email]> wrote:

Thanks for response;
And sorry the passed time.

The JobManager & TaskManager logged ports are open!


Is this log OK?
2018-01-15 13:40:03,455 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.16.20.18:6123/user/jobmanager:null.

When I kill task-manger, the jobmanager logs:
2018-01-15 13:32:41,419 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@stage_dbq_1:45532] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

But it will not decrement the number of available task-managers!
and when I start my signle task-manager again, it logs:

2018-01-15 13:32:52,753 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ??? (akka://flink/deadLetters) as 626846ae27a833cb094eeeb047a6a72c. Current number of registered hosts is 2. Current number of alive task slots is 40.


On Wed, Jan 10, 2018 at 11:36 AM, Piotr Nowojski <[hidden email]> wrote:
Hi,

Search both job manager and task manager logs for ip address(es) and port(s) that have timeouted. First of all make sure that nodes are visible to each other using some simple ping. Afterwards please check that those timeouted ports are opened and not blocked by some firewall (telnet).

You can search the documentation for the configuration parameters with “port” in name:
But note that many of them are random by default.

Piotrek

On 9 Jan 2018, at 17:56, Reza Samee <[hidden email]> wrote:


I'm running a flink-cluster (a mini one with just one node); but the problem is that my TaskManager can't reach to my JobManager!

Here are logs from TaskManager
...
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 20, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 21, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 22, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 23, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 24, timeout: 30 seconds)
...

My "JobManager UI" shows my TaskManager with this Path & ID: "akka://flink/deadLetters" ( in TaskManagers tab)
And I found these lines in my JobManger stdout:

Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-275619168] - leader session null
TaskManager ResourceID{resourceId='1132cbdaf2d8204e5e42e321e8592754'} has started.
Registered TaskManager at MY_PRIV_IP (akka://flink/deadLetters) as 7d9568445b4557a74d05a0771a08ad9c. Current number of registered hosts is 1. Current number of alive task slots is 20.


What's the meaning of these lines? Where should I look for the solution?




--
رضا سامعی / http://samee.blog.ir




--
رضا سامعی / http://samee.blog.ir




--
رضا سامعی / http://samee.blog.ir

flink-jobmanager.out (15K) Download Attachment
flink-taskmanager.out (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: What's the meaning of "Registered `TaskManager` at akka://flink/deadLetters " ?

Piotr Nowojski
Hi,

It seems like you have not opened some of the ports. As I pointed out in the first mail, please go through all of the config options regarding hostnames/ports (not only those that appear in the log files, maybe something is not being logged) 

jobmanager.rpc.port
taskmanager.rpc.port
taskmanager.data.port
blob.server.port 

And double check that they are accessible from appropriate machines, best by using some external tool like telnet and ncat. You network can be configured to accept some connections only from specific hosts (like localhost). For example in the case for which you attached the log files, did you check that the job manager host, can open a connection to the `stage_dbq_1:33633` (task manager host and it’s rpc port - rpc port by default is random).

Also make sure that the configurations on the task manager and job manager are consistent.

Piotrek

On 18 Jan 2018, at 08:41, Reza Samee <[hidden email]> wrote:

Hi,

I attached log file,

Thanks

On Mon, Jan 15, 2018 at 3:36 PM, Piotr Nowojski <[hidden email]> wrote:
Hi,

Could you post full job manager and task manager logs from startup until the first signs of the problem?

Thanks, Piotrek


On 15 Jan 2018, at 11:21, Reza Samee <[hidden email]> wrote:

Thanks for response;
And sorry the passed time.

The JobManager & TaskManager logged ports are open!


Is this log OK?
2018-01-15 13:40:03,455 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.16.20.18:6123/user/jobmanager:null.

When I kill task-manger, the jobmanager logs:
2018-01-15 13:32:41,419 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@stage_dbq_1:45532] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

But it will not decrement the number of available task-managers!
and when I start my signle task-manager again, it logs:

2018-01-15 13:32:52,753 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at ??? (akka://flink/deadLetters) as 626846ae27a833cb094eeeb047a6a72c. Current number of registered hosts is 2. Current number of alive task slots is 40.


On Wed, Jan 10, 2018 at 11:36 AM, Piotr Nowojski <[hidden email]> wrote:
Hi,

Search both job manager and task manager logs for ip address(es) and port(s) that have timeouted. First of all make sure that nodes are visible to each other using some simple ping. Afterwards please check that those timeouted ports are opened and not blocked by some firewall (telnet).

You can search the documentation for the configuration parameters with “port” in name:
But note that many of them are random by default.

Piotrek

On 9 Jan 2018, at 17:56, Reza Samee <[hidden email]> wrote:


I'm running a flink-cluster (a mini one with just one node); but the problem is that my TaskManager can't reach to my JobManager!

Here are logs from TaskManager
...
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 20, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 21, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 22, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 23, timeout: 30 seconds)
Trying to register at JobManager akka.tcp://flink@MY_PRIV_IP/user/jobmanager (attempt 24, timeout: 30 seconds)
...

My "JobManager UI" shows my TaskManager with this Path & ID: "akka://flink/deadLetters" ( in TaskManagers tab)
And I found these lines in my JobManger stdout:

Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-275619168] - leader session null
TaskManager ResourceID{resourceId='1132cbdaf2d8204e5e42e321e8592754'} has started.
Registered TaskManager at MY_PRIV_IP (akka://flink/deadLetters) as 7d9568445b4557a74d05a0771a08ad9c. Current number of registered hosts is 1. Current number of alive task slots is 20.


What's the meaning of these lines? Where should I look for the solution?




--
رضا سامعی / http://samee.blog.ir




--
رضا سامعی / http://samee.blog.ir




--
رضا سامعی / http://samee.blog.ir
<flink-jobmanager.out><flink-taskmanager.out>