Hi,
I am running Flink on a cluster with 24 workers, each with 16 cores. Starting the cluster works fine and the Web interface confirms there are 384 slots working. Executing my code with parallelism 24 works fine, but when I try a higher parallelism, eg. 384, the job never succeeds in submitting. Also submitting from the web interface does not start the job, nor gives any errors. I also tried starting 4 1-slot taskmanagers on each machine, and executing with parallelism 96, but same problem. The code is not very complicated, with the logical graph having only 3 steps. Attached is a file with the jstacks of the CliFrontend that is using CPU, and the StandaloneSessionClusterEntrypoint, as well as the jstack of the TaskManagerRunner on a remote machine(cloud-12). The jstacks are all from this last scenario, when executing from command line. My relevant conf is as follows: queryable-state.enable: true jobmanager.rpc.address: cloud-11 jobmanager.rpc.port: 6123 taskmanager.heap.mb: 28672 jobmanager.heap.mb: 14240 taskmanager.memory.fraction: 0.7 taskmanager.network.numberOfBuffers: 16384 taskmanager.network.bufferSizeInBytes: 16384 taskmanager.memory.task.off-heap.size: 4000m taskmanager.memory.managed.size: 10000m #taskmanager.numberOfTaskSlots: 16 #for normal setup taskmanager.numberOfTaskSlots: 1 #for when setting multiple taskmanagers per machine. Am I doing something wrong? Thanks in advance! jstack.jstack <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2502/jstack.jstack> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Ah, the good old cloud-11 cluster at DIMA. I used that one as well in 2014 to test Flink there :) Now regarding your question: Is it possible that "Experiments.Experiment1(Experiments.java:42)" depends on the parallelism, and it is doing a lot more work than expected because of that? On Mon, Jul 27, 2020 at 9:50 PM Annemarie Burger <[hidden email]> wrote: Hi, |
Hi Annemarie, could you please share your topology? If you have a shuffle, your job needs 2 slots per parallelism. So you'd only be able to scale up to 384/2. On Tue, Jul 28, 2020 at 6:32 PM Robert Metzger <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
In reply to this post by rmetzger0
Hi!
The problem was indeed a exponentially slow subtask that related to the parallelism, all working now, thanks! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |