Unable to submit high parallelism job in cluster

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Unable to submit high parallelism job in cluster

Annemarie Burger
Hi,

I am running Flink on a cluster with 24 workers, each with 16 cores.
Starting the cluster works fine and the Web interface confirms there are 384
slots working. Executing my code with parallelism 24 works fine, but when I
try a higher parallelism, eg. 384, the job never succeeds in submitting.
Also submitting from the web interface does not start the job, nor gives any
errors. I also tried starting 4 1-slot taskmanagers on each machine, and
executing with parallelism 96, but same problem. The code is not very
complicated, with the logical graph having only 3 steps.
Attached is a file with the jstacks of the CliFrontend that is using CPU,
and the StandaloneSessionClusterEntrypoint, as well as the jstack of the
TaskManagerRunner on a remote machine(cloud-12). The jstacks are all from
this last scenario, when executing from command line.
 
My relevant conf is as follows:

queryable-state.enable: true
jobmanager.rpc.address: cloud-11
jobmanager.rpc.port: 6123
taskmanager.heap.mb: 28672
jobmanager.heap.mb: 14240
taskmanager.memory.fraction: 0.7
taskmanager.network.numberOfBuffers: 16384
taskmanager.network.bufferSizeInBytes: 16384
taskmanager.memory.task.off-heap.size: 4000m
taskmanager.memory.managed.size: 10000m
#taskmanager.numberOfTaskSlots: 16 #for normal setup
taskmanager.numberOfTaskSlots: 1 #for when setting multiple taskmanagers per
machine.

Am I doing something wrong?
Thanks in advance!

  jstack.jstack
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2502/jstack.jstack>  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to submit high parallelism job in cluster

rmetzger0
Ah, the good old cloud-11 cluster at DIMA. I used that one as well in 2014 to test Flink there :)

Now regarding your question: Is it possible that "Experiments.Experiment1(Experiments.java:42)" depends on the parallelism, and it is doing a lot more work than expected because of that?

On Mon, Jul 27, 2020 at 9:50 PM Annemarie Burger <[hidden email]> wrote:
Hi,

I am running Flink on a cluster with 24 workers, each with 16 cores.
Starting the cluster works fine and the Web interface confirms there are 384
slots working. Executing my code with parallelism 24 works fine, but when I
try a higher parallelism, eg. 384, the job never succeeds in submitting.
Also submitting from the web interface does not start the job, nor gives any
errors. I also tried starting 4 1-slot taskmanagers on each machine, and
executing with parallelism 96, but same problem. The code is not very
complicated, with the logical graph having only 3 steps.
Attached is a file with the jstacks of the CliFrontend that is using CPU,
and the StandaloneSessionClusterEntrypoint, as well as the jstack of the
TaskManagerRunner on a remote machine(cloud-12). The jstacks are all from
this last scenario, when executing from command line.

My relevant conf is as follows:

queryable-state.enable: true
jobmanager.rpc.address: cloud-11
jobmanager.rpc.port: 6123
taskmanager.heap.mb: 28672
jobmanager.heap.mb: 14240
taskmanager.memory.fraction: 0.7
taskmanager.network.numberOfBuffers: 16384
taskmanager.network.bufferSizeInBytes: 16384
taskmanager.memory.task.off-heap.size: 4000m
taskmanager.memory.managed.size: 10000m
#taskmanager.numberOfTaskSlots: 16 #for normal setup
taskmanager.numberOfTaskSlots: 1 #for when setting multiple taskmanagers per
machine.

Am I doing something wrong?
Thanks in advance!

  jstack.jstack
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2502/jstack.jstack



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to submit high parallelism job in cluster

Arvid Heise-3
Hi Annemarie,

could you please share your topology? If you have a shuffle, your job needs 2 slots per parallelism. So you'd only be able to scale up to 384/2.

On Tue, Jul 28, 2020 at 6:32 PM Robert Metzger <[hidden email]> wrote:
Ah, the good old cloud-11 cluster at DIMA. I used that one as well in 2014 to test Flink there :)

Now regarding your question: Is it possible that "Experiments.Experiment1(Experiments.java:42)" depends on the parallelism, and it is doing a lot more work than expected because of that?

On Mon, Jul 27, 2020 at 9:50 PM Annemarie Burger <[hidden email]> wrote:
Hi,

I am running Flink on a cluster with 24 workers, each with 16 cores.
Starting the cluster works fine and the Web interface confirms there are 384
slots working. Executing my code with parallelism 24 works fine, but when I
try a higher parallelism, eg. 384, the job never succeeds in submitting.
Also submitting from the web interface does not start the job, nor gives any
errors. I also tried starting 4 1-slot taskmanagers on each machine, and
executing with parallelism 96, but same problem. The code is not very
complicated, with the logical graph having only 3 steps.
Attached is a file with the jstacks of the CliFrontend that is using CPU,
and the StandaloneSessionClusterEntrypoint, as well as the jstack of the
TaskManagerRunner on a remote machine(cloud-12). The jstacks are all from
this last scenario, when executing from command line.

My relevant conf is as follows:

queryable-state.enable: true
jobmanager.rpc.address: cloud-11
jobmanager.rpc.port: 6123
taskmanager.heap.mb: 28672
jobmanager.heap.mb: 14240
taskmanager.memory.fraction: 0.7
taskmanager.network.numberOfBuffers: 16384
taskmanager.network.bufferSizeInBytes: 16384
taskmanager.memory.task.off-heap.size: 4000m
taskmanager.memory.managed.size: 10000m
#taskmanager.numberOfTaskSlots: 16 #for normal setup
taskmanager.numberOfTaskSlots: 1 #for when setting multiple taskmanagers per
machine.

Am I doing something wrong?
Thanks in advance!

  jstack.jstack
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2502/jstack.jstack



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Unable to submit high parallelism job in cluster

Annemarie Burger
In reply to this post by rmetzger0
Hi!

The problem was indeed a exponentially slow subtask that related to the
parallelism, all working now, thanks!



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/