Hi Ovidiu, Checking the /var/log/messages based on Greg's response revealed TMs were killed due to out of memory. Here's the node architecture. Each node has 128GB of RAM. I was trying to run 2 TMs per node binding each to 12 cores (or 1 socket). The total number of nodes were 16. I finally, managed to get it working with the following (non-default) settings. taskmanager.heap.mb: 12288 taskmanager.numberOfTaskSlots: 12 akka.ask.timeout: 1000 s taskmanager.network.numberOfBuffers: 36864 Note, the number of buffers value, this had to be higher (twice in this case) than what's suggested in Flink (#slots-per-TM^2 * #TMs * 4, which would be 12*12*32*4 = 18432). Otherwise, it would throw me the not enough buffers error. Thank you, Saliya On Tue, Jul 12, 2016 at 7:39 AM, Ovidiu-Cristian MARCU <[hidden email]> wrote:
Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington |
Hi,
I would pay attention to the memory settings such that heap+off-heap+network buffers can be served from your node’s RAM for both TMs. Also, there is some correlation between the number of buffers, parallelism and your workflow’s operators. The suggestion to be used for the numberOfBuffers does not work in every case. I guess the numberOfBuffers could be automatically determined based on the parallelism and workflow’s operators, not sure how to do that. Best, Ovidiu
|
Thank you, Ovidiu. On Wed, Jul 13, 2016 at 3:34 PM, Ovidiu-Cristian MARCU <[hidden email]> wrote:
Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington |
Free forum by Nabble | Edit this page |