Re: JM/TM startup time

Posted by Robert Schmidtke on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/JM-TM-startup-time-tp3009p3016.html

Looking into the logs of each TM it only took about 5 seconds per TM to go from "Trying to register" to "Successful registration".

On Fri, Oct 2, 2015 at 5:50 PM, Robert Schmidtke <[hidden email]> wrote:
I recently switched from running Flink on YARN to running Flink Standalone and I realized I had to add a sleep after ./start-cluster.sh (well, my Slurm adaptation of it). I did not have to explicitly wait before since Flink would wait until all YARN containers became available, so to be honest I don't know whether this is new or not. I just looked into an old log (well, from last Friday) and it took about 1 minute for 31 TMs to connect to 1 JM. They each had -Xms and -Xmx6079m though.

On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen <[hidden email]> wrote:
Is that a new observation that it takes so long, or has it always taken so long?

On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <[hidden email]> wrote:
I figured the JM would be waiting for the TMs. Each of my nodes has 64G of memory available.

On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO  org.apache.flink.runtime.taskmanager.TaskManager
          - Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:
> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO  org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO  org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO  org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO  org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO  org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO  org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO  org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO  org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680



--
My GPG Key ID: 336E2680




--
My GPG Key ID: 336E2680



--
My GPG Key ID: 336E2680