How to increase akka heartbeat?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to increase akka heartbeat?

Saiph Kappa
Hi,

I have a Flink client application that launches jobs to remote clusters. However I'm getting my jobs cancelled:
"18:25:29,650 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that configuration parameter, in the client or in the Flink clusters, and in which file.

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: How to increase akka heartbeat?

rmetzger0
Hi Saiph,

are you sure that the jobs are cancelled because the client disconnects?

For the different timeouts, check the configuration page: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html and search for "heartbeat".

On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <[hidden email]> wrote:
Hi,

I have a Flink client application that launches jobs to remote clusters. However I'm getting my jobs cancelled:
"18:25:29,650 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that configuration parameter, in the client or in the Flink clusters, and in which file.

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: How to increase akka heartbeat?

Saiph Kappa
I am not sure.

For that particular machine I get messages like these:
«
myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from Actor[akka://flink/deadLetters].
^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new JobManager akka.tcp://flink@myip:6123/user/jobmanager.

^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1 (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].

^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
»

I tried to set the heartbeat interval in the cluster but it didn't solve the problem, should I try to set it in the client (how can I do it)? I see no other errors or exceptions on the log files.




On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <[hidden email]> wrote:
Hi Saiph,

are you sure that the jobs are cancelled because the client disconnects?

For the different timeouts, check the configuration page: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html and search for "heartbeat".

On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <[hidden email]> wrote:
Hi,

I have a Flink client application that launches jobs to remote clusters. However I'm getting my jobs cancelled:
"18:25:29,650 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that configuration parameter, in the client or in the Flink clusters, and in which file.

Thanks.



Reply | Threaded
Open this post in threaded view
|

Re: How to increase akka heartbeat?

Saiph Kappa
These were the parameters that I set btw:

akka.watch.heartbeat.interval: 100
akka.transport.heartbeat.interval: 1000

On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <[hidden email]> wrote:
I am not sure.

For that particular machine I get messages like these:
«
myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from Actor[akka://flink/deadLetters].
^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new JobManager akka.tcp://flink@myip:6123/user/jobmanager.

^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1 (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].

^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
»

I tried to set the heartbeat interval in the cluster but it didn't solve the problem, should I try to set it in the client (how can I do it)? I see no other errors or exceptions on the log files.




On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <[hidden email]> wrote:
Hi Saiph,

are you sure that the jobs are cancelled because the client disconnects?

For the different timeouts, check the configuration page: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html and search for "heartbeat".

On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <[hidden email]> wrote:
Hi,

I have a Flink client application that launches jobs to remote clusters. However I'm getting my jobs cancelled:
"18:25:29,650 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that configuration parameter, in the client or in the Flink clusters, and in which file.

Thanks.




Reply | Threaded
Open this post in threaded view
|

Re: How to increase akka heartbeat?

rmetzger0
Hi,
can you maybe (if you want also private) send me the full logs of the jobmanager? The messages you've posted here are logged at DEBUG level. They don't indicate an erroneous behavior of the system.

On Fri, Feb 19, 2016 at 8:44 PM, Saiph Kappa <[hidden email]> wrote:
These were the parameters that I set btw:

akka.watch.heartbeat.interval: 100
akka.transport.heartbeat.interval: 1000

On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <[hidden email]> wrote:
I am not sure.

For that particular machine I get messages like these:
«
myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from Actor[akka://flink/deadLetters].
^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new JobManager akka.tcp://flink@myip:6123/user/jobmanager.

^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1 (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].

^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
»

I tried to set the heartbeat interval in the cluster but it didn't solve the problem, should I try to set it in the client (how can I do it)? I see no other errors or exceptions on the log files.




On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <[hidden email]> wrote:
Hi Saiph,

are you sure that the jobs are cancelled because the client disconnects?

For the different timeouts, check the configuration page: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html and search for "heartbeat".

On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <[hidden email]> wrote:
Hi,

I have a Flink client application that launches jobs to remote clusters. However I'm getting my jobs cancelled:
"18:25:29,650 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that configuration parameter, in the client or in the Flink clusters, and in which file.

Thanks.





Reply | Threaded
Open this post in threaded view
|

Re: How to increase akka heartbeat?

Stephan Ewen
Hi Saiph!

What is the problem that is happening? The log actually looks like the Job is successfully sent to the JobManager.

Stephan



On Fri, Feb 19, 2016 at 8:49 PM, Robert Metzger <[hidden email]> wrote:
Hi,
can you maybe (if you want also private) send me the full logs of the jobmanager? The messages you've posted here are logged at DEBUG level. They don't indicate an erroneous behavior of the system.

On Fri, Feb 19, 2016 at 8:44 PM, Saiph Kappa <[hidden email]> wrote:
These were the parameters that I set btw:

akka.watch.heartbeat.interval: 100
akka.transport.heartbeat.interval: 1000

On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <[hidden email]> wrote:
I am not sure.

For that particular machine I get messages like these:
«
myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from Actor[akka://flink/deadLetters].
^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new JobManager akka.tcp://flink@myip:6123/user/jobmanager.

^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1 (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].

^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
»

I tried to set the heartbeat interval in the cluster but it didn't solve the problem, should I try to set it in the client (how can I do it)? I see no other errors or exceptions on the log files.




On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <[hidden email]> wrote:
Hi Saiph,

are you sure that the jobs are cancelled because the client disconnects?

For the different timeouts, check the configuration page: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html and search for "heartbeat".

On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <[hidden email]> wrote:
Hi,

I have a Flink client application that launches jobs to remote clusters. However I'm getting my jobs cancelled:
"18:25:29,650 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that configuration parameter, in the client or in the Flink clusters, and in which file.

Thanks.






Reply | Threaded
Open this post in threaded view
|

Re: How to increase akka heartbeat?

Saiph Kappa
Thanks for your help. Apparently the problem was not in Akka. It seems that when using a source .socketTextStream with maxRetry = -1, it continually attempts to connect to the socket for the 1st time, but once it is connected, and if no data is sent, it seems that the job is terminated.

On Fri, Feb 19, 2016 at 9:13 PM, Stephan Ewen <[hidden email]> wrote:
Hi Saiph!

What is the problem that is happening? The log actually looks like the Job is successfully sent to the JobManager.

Stephan



On Fri, Feb 19, 2016 at 8:49 PM, Robert Metzger <[hidden email]> wrote:
Hi,
can you maybe (if you want also private) send me the full logs of the jobmanager? The messages you've posted here are logged at DEBUG level. They don't indicate an erroneous behavior of the system.

On Fri, Feb 19, 2016 at 8:44 PM, Saiph Kappa <[hidden email]> wrote:
These were the parameters that I set btw:

akka.watch.heartbeat.interval: 100
akka.transport.heartbeat.interval: 1000

On Fri, Feb 19, 2016 at 7:43 PM, Saiph Kappa <[hidden email]> wrote:
I am not sure.

For that particular machine I get messages like these:
«
myip:6123/user/jobmanager#291801197])) at akka://flink/user/$a from Actor[akka://flink/deadLetters].
^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Connected to new JobManager akka.tcp://flink@myip:6123/user/jobmanager.

^[[34m[INFO]^[[0;39m o.a.f.r.c.JobClientActor    - Sending message to JobManager akka.tcp://flink@myip:6123/user/jobmanager to submit job JOB1 (5f9cef0c2e4b69530bf1e2485e94d326) and wait for progress


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].


^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Handled message LeaderSessionMessage(null,JobManagerActorRef(Actor[akka.tcp://flink@myip:6123/user/jobmanager#291801197])) in 48 ms from Actor[akka://flink/deadLetters].

^[[39m[DEBUG]^[[0;39m o.a.f.r.c.JobClientActor    - Received message JobSubmitSuccess(2575d5ff5c10336beb7820a052a63623) at akka://flink/user/$a from Actor[akka.tcp://flink@myip:6123/user/jobmanager#1144818256].
»

I tried to set the heartbeat interval in the cluster but it didn't solve the problem, should I try to set it in the client (how can I do it)? I see no other errors or exceptions on the log files.




On Fri, Feb 19, 2016 at 7:07 PM, Robert Metzger <[hidden email]> wrote:
Hi Saiph,

are you sure that the jobs are cancelled because the client disconnects?

For the different timeouts, check the configuration page: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html and search for "heartbeat".

On Fri, Feb 19, 2016 at 8:04 PM, Saiph Kappa <[hidden email]> wrote:
Hi,

I have a Flink client application that launches jobs to remote clusters. However I'm getting my jobs cancelled:
"18:25:29,650 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:52929] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]."

How can I increase the akka heartbeat interval? Where should I set that configuration parameter, in the client or in the Flink clusters, and in which file.

Thanks.