Job Manager Configuration

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Job Manager Configuration

Chan, Regina

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 

Reply | Threaded
Open this post in threaded view
|

RE: Job Manager Configuration

Chan, Regina

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 

Reply | Threaded
Open this post in threaded view
|

Re: Job Manager Configuration

Chesnay Schepler
AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.

The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 


Reply | Threaded
Open this post in threaded view
|

RE: Job Manager Configuration

Newport, Billy

The user code for all the flows is common though so is there an inefficiency here in terms of copying this code for every operator?

 

 

From: Chesnay Schepler [mailto:[hidden email]]
Sent: Wednesday, November 01, 2017 7:09 AM
To: [hidden email]
Subject: Re: Job Manager Configuration

 

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.

The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 

 

Reply | Threaded
Open this post in threaded view
|

RE: Job Manager Configuration

Chan, Regina
In reply to this post by Chesnay Schepler

Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I’m perfectly fine with it queuing up and running when it has the resources to.

 

 

 

From: Chesnay Schepler [mailto:[hidden email]]
Sent: Wednesday, November 01, 2017 7:09 AM
To: [hidden email]
Subject: Re: Job Manager Configuration

 

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.

The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Job Manager Configuration

Till Rohrmann-2
Hi Regina,

the user code is uploaded once to the `JobManager` and then downloaded from each `TaskManager` once when it first receives the command to execute the first task of your job.

As Chesnay said there is no fundamental limitation to the size of the Flink job. However, it might be the case that you have configured your job sub-optimally. You said that you have 300 parallel flows. Depending on whether you've defined separate slot sharing groups for them or not, it might be the case that parallel subtasks of all 300 parallel jobs share the same slot (if you haven't changed the slot sharing group). Depending on what you calculate, this can be inefficient because the individual tasks don't get much computation time. Moreover, all tasks will allocate some objects on the heap which can lead to more GC. Therefore, it might make sense to group some of the jobs together and run these jobs in batches after the previous batch completed. But this is hard to say without knowing the details of your job and getting a glimpse at the JobManager logs.

Concerning the exception you're seeing, it would also be helpful to see the logs of the client and the JobManager. Actually, the scheduling of the job is independent of the response. Only the creation of the ExecutionGraph and making the JobGraph highly available in case of an HA setup are executed before the JobManager acknowledges the job submission. Only if this acknowledge message is not received in time on the client side, then the SubmissionTimeoutException is thrown. Therefore, I assume that somehow the JobManager is too busy or kept from sending the acknowledge message.

Cheers,
Till 



On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <[hidden email]> wrote:

Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I’m perfectly fine with it queuing up and running when it has the resources to.

 

 

 

From: Chesnay Schepler [mailto:[hidden email]]
Sent: Wednesday, November 01, 2017 7:09 AM
To: [hidden email]
Subject: Re: Job Manager Configuration

 

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.



The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  <a href="tel:(212)%20902-5697" value="+12129025697" target="_blank">(212) 902-5697

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Job Manager Configuration

Till Rohrmann
Quick question Regina: Which version of Flink are you running?

Cheers,
Till

On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <[hidden email]> wrote:
Hi Regina,

the user code is uploaded once to the `JobManager` and then downloaded from each `TaskManager` once when it first receives the command to execute the first task of your job.

As Chesnay said there is no fundamental limitation to the size of the Flink job. However, it might be the case that you have configured your job sub-optimally. You said that you have 300 parallel flows. Depending on whether you've defined separate slot sharing groups for them or not, it might be the case that parallel subtasks of all 300 parallel jobs share the same slot (if you haven't changed the slot sharing group). Depending on what you calculate, this can be inefficient because the individual tasks don't get much computation time. Moreover, all tasks will allocate some objects on the heap which can lead to more GC. Therefore, it might make sense to group some of the jobs together and run these jobs in batches after the previous batch completed. But this is hard to say without knowing the details of your job and getting a glimpse at the JobManager logs.

Concerning the exception you're seeing, it would also be helpful to see the logs of the client and the JobManager. Actually, the scheduling of the job is independent of the response. Only the creation of the ExecutionGraph and making the JobGraph highly available in case of an HA setup are executed before the JobManager acknowledges the job submission. Only if this acknowledge message is not received in time on the client side, then the SubmissionTimeoutException is thrown. Therefore, I assume that somehow the JobManager is too busy or kept from sending the acknowledge message.

Cheers,
Till 



On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <[hidden email]> wrote:

Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I’m perfectly fine with it queuing up and running when it has the resources to.

 

 

 

From: Chesnay Schepler [mailto:[hidden email]]
Sent: Wednesday, November 01, 2017 7:09 AM
To: [hidden email]
Subject: Re: Job Manager Configuration

 

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.



The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  <a href="tel:(212)%20902-5697" value="+12129025697" target="_blank">(212) 902-5697

 

 



Reply | Threaded
Open this post in threaded view
|

RE: Job Manager Configuration

Chan, Regina

Thanks for the responses!

 

I’m currently using 1.2.0 – going to bump it up once I have things stabilized. I haven’t defined any slot sharing groups but I do think that I’ve probably got my job configured sub optimally. I’ve refactored my code so that I can submit subsets of the flow at a time and it seems to work. The break between the JobManager able to acknowledge job and not seems to hover somewhere between 10-20 flows.

 

I guess what doesn’t make too much sense to me is if the user code is uploaded once to the JobManager and downloaded from each TaskManager, what exactly is the JobManager doing that’s keeping it busy? It’s the same code across the TaskManagers.

 

I’ll get you the logs shortly.

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Wednesday, November 08, 2017 10:17 AM
To: Chan, Regina [Tech]
Cc: Chesnay Schepler; [hidden email]
Subject: Re: Job Manager Configuration

 

Quick question Regina: Which version of Flink are you running?

 

Cheers,
Till

 

On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <[hidden email]> wrote:

Hi Regina,

 

the user code is uploaded once to the `JobManager` and then downloaded from each `TaskManager` once when it first receives the command to execute the first task of your job.

 

As Chesnay said there is no fundamental limitation to the size of the Flink job. However, it might be the case that you have configured your job sub-optimally. You said that you have 300 parallel flows. Depending on whether you've defined separate slot sharing groups for them or not, it might be the case that parallel subtasks of all 300 parallel jobs share the same slot (if you haven't changed the slot sharing group). Depending on what you calculate, this can be inefficient because the individual tasks don't get much computation time. Moreover, all tasks will allocate some objects on the heap which can lead to more GC. Therefore, it might make sense to group some of the jobs together and run these jobs in batches after the previous batch completed. But this is hard to say without knowing the details of your job and getting a glimpse at the JobManager logs.

 

Concerning the exception you're seeing, it would also be helpful to see the logs of the client and the JobManager. Actually, the scheduling of the job is independent of the response. Only the creation of the ExecutionGraph and making the JobGraph highly available in case of an HA setup are executed before the JobManager acknowledges the job submission. Only if this acknowledge message is not received in time on the client side, then the SubmissionTimeoutException is thrown. Therefore, I assume that somehow the JobManager is too busy or kept from sending the acknowledge message.

 

Cheers,

Till 

 

 

 

On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <[hidden email]> wrote:

Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I’m perfectly fine with it queuing up and running when it has the resources to.

 

 

 

From: Chesnay Schepler [mailto:[hidden email]]
Sent: Wednesday, November 01, 2017 7:09 AM
To: [hidden email]
Subject: Re: Job Manager Configuration

 

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.



The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  <a href="tel:(212)%20902-5697" target="_blank">(212) 902-5697

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Job Manager Configuration

Till Rohrmann
That is the question I hope to be able to answer with the logs. Let's see what they say.

Cheers,
Till

On Wed, Nov 8, 2017 at 7:24 PM, Chan, Regina <[hidden email]> wrote:

Thanks for the responses!

 

I’m currently using 1.2.0 – going to bump it up once I have things stabilized. I haven’t defined any slot sharing groups but I do think that I’ve probably got my job configured sub optimally. I’ve refactored my code so that I can submit subsets of the flow at a time and it seems to work. The break between the JobManager able to acknowledge job and not seems to hover somewhere between 10-20 flows.

 

I guess what doesn’t make too much sense to me is if the user code is uploaded once to the JobManager and downloaded from each TaskManager, what exactly is the JobManager doing that’s keeping it busy? It’s the same code across the TaskManagers.

 

I’ll get you the logs shortly.

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Wednesday, November 08, 2017 10:17 AM
To: Chan, Regina [Tech]
Cc: Chesnay Schepler; [hidden email]


Subject: Re: Job Manager Configuration

 

Quick question Regina: Which version of Flink are you running?

 

Cheers,
Till

 

On Tue, Nov 7, 2017 at 4:38 PM, Till Rohrmann <[hidden email]> wrote:

Hi Regina,

 

the user code is uploaded once to the `JobManager` and then downloaded from each `TaskManager` once when it first receives the command to execute the first task of your job.

 

As Chesnay said there is no fundamental limitation to the size of the Flink job. However, it might be the case that you have configured your job sub-optimally. You said that you have 300 parallel flows. Depending on whether you've defined separate slot sharing groups for them or not, it might be the case that parallel subtasks of all 300 parallel jobs share the same slot (if you haven't changed the slot sharing group). Depending on what you calculate, this can be inefficient because the individual tasks don't get much computation time. Moreover, all tasks will allocate some objects on the heap which can lead to more GC. Therefore, it might make sense to group some of the jobs together and run these jobs in batches after the previous batch completed. But this is hard to say without knowing the details of your job and getting a glimpse at the JobManager logs.

 

Concerning the exception you're seeing, it would also be helpful to see the logs of the client and the JobManager. Actually, the scheduling of the job is independent of the response. Only the creation of the ExecutionGraph and making the JobGraph highly available in case of an HA setup are executed before the JobManager acknowledges the job submission. Only if this acknowledge message is not received in time on the client side, then the SubmissionTimeoutException is thrown. Therefore, I assume that somehow the JobManager is too busy or kept from sending the acknowledge message.

 

Cheers,

Till 

 

 

 

On Thu, Nov 2, 2017 at 7:18 PM, Chan, Regina <[hidden email]> wrote:

Does it copy per TaskManager or per operator? I only gave it 10 TaskManagers with 2 slots. I’m perfectly fine with it queuing up and running when it has the resources to.

 

 

 

From: Chesnay Schepler [mailto:[hidden email]]
Sent: Wednesday, November 01, 2017 7:09 AM
To: [hidden email]
Subject: Re: Job Manager Configuration

 

AFAIK there is no theoretical limit on the size of the plan, it just depends on the available resources.



The job submissions times out since it takes too long to deploy all the operators that the job defines. With 300 flows, each with 6 operators you're looking at potentially (1800 * parallelism) tasks that have to be deployed. For each task Flink copies the user-code of all flows to the executing TaskManager, which the network may just not be handle in time.

I suggest to split your job into smaller batches or even run each of them independently.

On 31.10.2017 16:25, Chan, Regina wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech]
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  <a href="tel:(212)%20902-5697" target="_blank">(212) 902-5697

 

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Job Manager Configuration

Joshua Griffith
In reply to this post by Chan, Regina
I have an IO-dominated batch job with 471 distinct tasks (3786 tasks with parallelism) running on 8 nodes with 12 GiB of memory and 4 CPUs each. I haven’t had any problems adding additional tasks except for 1) tasks timing out the first time the cluster is started (I suppose the JVM needs to warm up), and 2) the UI can’t really handle this many tasks, although using Firefox Quantum makes it possible to see what’s going on.

Joshua

On Oct 31, 2017, at 10:25 AM, Chan, Regina <[hidden email]> wrote:

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.
 
Thanks,
Regina
 
From: Chan, Regina [Tech] 
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration
 
Flink Users,
 
I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?
 
Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.
 
Regina Chan
Goldman Sachs  Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

Reply | Threaded
Open this post in threaded view
|

RE: Job Manager Configuration

Chan, Regina

Is your job running on a standalone cluster? I’m using a detached yarn session in a multi-tenant environment.

And I’m guessing you haven’t had to do anything special for the akka configurations.

 

 

From: Joshua Griffith [mailto:[hidden email]]
Sent: Thursday, November 16, 2017 2:57 PM
To: Chan, Regina [Tech]
Cc: [hidden email]
Subject: Re: Job Manager Configuration

 

I have an IO-dominated batch job with 471 distinct tasks (3786 tasks with parallelism) running on 8 nodes with 12 GiB of memory and 4 CPUs each. I haven’t had any problems adding additional tasks except for 1) tasks timing out the first time the cluster is started (I suppose the JVM needs to warm up), and 2) the UI can’t really handle this many tasks, although using Firefox Quantum makes it possible to see what’s going on.

 

Joshua

 

On Oct 31, 2017, at 10:25 AM, Chan, Regina <[hidden email]> wrote:

 

Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.

 

Thanks,

Regina

 

From: Chan, Regina [Tech] 
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration

 

Flink Users,

 

I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?

 

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.

 

Regina Chan

Goldman Sachs  Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697

 

Reply | Threaded
Open this post in threaded view
|

Re: Job Manager Configuration

Joshua Griffith
We run on a dedicated cluster managed by Kubernetes. The task managers run as a DaemonSet and the job manager runs as a Deployment. We had to increase the Akka frame size and client timeout on the service that submits jobs but we haven’t altered any Akka settings in the cluster. Here’s the container we run: https://github.com/orgsync/docker-flink

On Nov 18, 2017, at 4:10 PM, Chan, Regina <[hidden email]> wrote:

Is your job running on a standalone cluster? I’m using a detached yarn session in a multi-tenant environment.
And I’m guessing you haven’t had to do anything special for the akka configurations.
 
 
From: Joshua Griffith [[hidden email]] 
Sent: Thursday, November 16, 2017 2:57 PM
To: Chan, Regina [Tech]
Cc: [hidden email]
Subject: Re: Job Manager Configuration
 
I have an IO-dominated batch job with 471 distinct tasks (3786 tasks with parallelism) running on 8 nodes with 12 GiB of memory and 4 CPUs each. I haven’t had any problems adding additional tasks except for 1) tasks timing out the first time the cluster is started (I suppose the JVM needs to warm up), and 2) the UI can’t really handle this many tasks, although using Firefox Quantum makes it possible to see what’s going on.
 
Joshua
 
On Oct 31, 2017, at 10:25 AM, Chan, Regina <[hidden email]> wrote:
 
Asking an additional question, what is the largest plan that the JobManager can handle? Is there a limit? My flows don’t need to run in parallel and can run independently. I wanted them to run in one single job because it’s part of one logical commit on my side.
 
Thanks,
Regina
 
From: Chan, Regina [Tech] 
Sent: Monday, October 30, 2017 3:22 PM
To: '[hidden email]'
Subject: Job Manager Configuration
 
Flink Users,
 
I have about 300 parallel flows in one job each with 2 inputs, 3 operators, and 1 sink which makes for a large job. I keep getting the below timeout exception but I’ve already set it to a 30 minute time out with a 6GB heap on the JobManager? Is there a heuristic to better configure the job manager?
 
Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.
 
Regina Chan
Goldman Sachs  Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697