(DEPRECATED) Apache Flink User Mailing List archive.

JM/TM startup time

Classic

List

Threaded

10 messages Options

Robert Schmidtke

JM/TM startup time

Hi everyone,

I'm wondering about the startup times of the TMs:

...

17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor

17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]

17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds

17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)

17:03:33,295 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).

17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using 0.7 of the currently free heap space for Flink managed heap memory (34395 MB).

// almost 40 seconds //

17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.

17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054

17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.

17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)

17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 16 task slot(s).

17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB (used/committed/max)]

17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)

17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network stack and library cache.

...

The same goes for the JM (obviously).

...

17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManger web frontend

17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Setting up web info server, using web-root directory jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.

17:03:31,753 INFO org.eclipse.jetty.util.log - jetty-0.10-SNAPSHOT

17:03:31,806 INFO org.eclipse.jetty.util.log - Started SelectChannelConnector@0.0.0.0:8081

17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Started web info server for JobManager on 0.0.0.0:8081

// almost 35 seconds //

17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at cumu02-02 (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1. Current number of alive task slots is 16.

...

Is this to be expected? Any ideas what's happening in the meantime? I'm asking because I'm running into errors when submitting my job too early (and not enough TMs have connected).

Cheers

Robert

My GPG Key ID: 336E2680

Stephan Ewen

Re: JM/TM startup time

The delay you see happens when the TaskManager allocates the memory for its memory manager. Allocating that much in a JVM can take a bit, although 40 seconds looks a lot to me...

How do you start the JVM? Are Xmx and Xms set to the same value? If not, the JVM incrementally grows through multiple garbage collections, which makes it quite slow.

If the JVM starts with a large heap, it should actually not take as long as in your case...

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

Hi everyone,

I'm wondering about the startup times of the TMs:

...
17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
17:03:33,295 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using 0.7 of the currently free heap space for Flink managed heap memory (34395 MB).

// almost 40 seconds //

17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 16 task slot(s).
17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB (used/committed/max)]
17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network stack and library cache.
...

The same goes for the JM (obviously).

...
17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManger web frontend
17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Setting up web info server, using web-root directory jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
17:03:31,753 INFO org.eclipse.jetty.util.log - jetty-0.10-SNAPSHOT
17:03:31,806 INFO org.eclipse.jetty.util.log - Started SelectChannelConnector@0.0.0.0:8081
17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Started web info server for JobManager on 0.0.0.0:8081

// almost 35 seconds //

17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at cumu02-02 (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1. Current number of alive task slots is 16.
...

Is this to be expected? Any ideas what's happening in the meantime? I'm asking because I'm running into errors when submitting my job too early (and not enough TMs have connected).

Cheers
Robert

--
My GPG Key ID: 336E2680

Maximilian Michels

Re: JM/TM startup time

In reply to this post by Robert Schmidtke

Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680

Robert Schmidtke

Re: JM/TM startup time

In reply to this post by Stephan Ewen

Yes, they're both set to the same value. These are the JVM Options as reported by the TM:

17:36:12,137 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM Options:

17:36:12,137 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xms51355M

17:36:12,137 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xmx51355M

17:36:12,137 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxDirectMemorySize=51355M

Same goes for JM:

7:36:10,837 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:

17:36:10,837 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms25677m

17:36:10,837 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx25677m

On Fri, Oct 2, 2015 at 5:35 PM, Stephan Ewen <[hidden email]> wrote:

The delay you see happens when the TaskManager allocates the memory for its memory manager. Allocating that much in a JVM can take a bit, although 40 seconds looks a lot to me...

How do you start the JVM? Are Xmx and Xms set to the same value? If not, the JVM incrementally grows through multiple garbage collections, which makes it quite slow.

If the JVM starts with a large heap, it should actually not take as long as in your case...

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm wondering about the startup times of the TMs:

...
17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
17:03:33,295 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using 0.7 of the currently free heap space for Flink managed heap memory (34395 MB).

// almost 40 seconds //

17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 16 task slot(s).
17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB (used/committed/max)]
17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network stack and library cache.
...

The same goes for the JM (obviously).

...
17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManger web frontend
17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Setting up web info server, using web-root directory jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
17:03:31,753 INFO org.eclipse.jetty.util.log - jetty-0.10-SNAPSHOT
17:03:31,806 INFO org.eclipse.jetty.util.log - Started SelectChannelConnector@0.0.0.0:8081
17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer - Started web info server for JobManager on 0.0.0.0:8081

// almost 35 seconds //

17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at cumu02-02 (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1. Current number of alive task slots is 16.
...

Is this to be expected? Any ideas what's happening in the meantime? I'm asking because I'm running into errors when submitting my job too early (and not enough TMs have connected).

Cheers
Robert

--
My GPG Key ID: 336E2680

My GPG Key ID: 336E2680

Robert Schmidtke

Re: JM/TM startup time

In reply to this post by Maximilian Michels

I figured the JM would be waiting for the TMs. Each of my nodes has 64G of memory available.

On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <[hidden email]> wrote:

Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680

My GPG Key ID: 336E2680

Stephan Ewen

Re: JM/TM startup time

Is that a new observation that it takes so long, or has it always taken so long?

On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <[hidden email]> wrote:

I figured the JM would be waiting for the TMs. Each of my nodes has 64G of memory available.

On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

Robert Schmidtke

Re: JM/TM startup time

I recently switched from running Flink on YARN to running Flink Standalone and I realized I had to add a sleep after ./start-cluster.sh (well, my Slurm adaptation of it). I did not have to explicitly wait before since Flink would wait until all YARN containers became available, so to be honest I don't know whether this is new or not. I just looked into an old log (well, from last Friday) and it took about 1 minute for 31 TMs to connect to 1 JM. They each had -Xms and -Xmx6079m though.

On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen <[hidden email]> wrote:

Is that a new observation that it takes so long, or has it always taken so long?

On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <[hidden email]> wrote:
I figured the JM would be waiting for the TMs. Each of my nodes has 64G of memory available.

On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

My GPG Key ID: 336E2680

Robert Schmidtke

Re: JM/TM startup time

Looking into the logs of each TM it only took about 5 seconds per TM to go from "Trying to register" to "Successful registration".

On Fri, Oct 2, 2015 at 5:50 PM, Robert Schmidtke <[hidden email]> wrote:

I recently switched from running Flink on YARN to running Flink Standalone and I realized I had to add a sleep after ./start-cluster.sh (well, my Slurm adaptation of it). I did not have to explicitly wait before since Flink would wait until all YARN containers became available, so to be honest I don't know whether this is new or not. I just looked into an old log (well, from last Friday) and it took about 1 minute for 31 TMs to connect to 1 JM. They each had -Xms and -Xmx6079m though.

On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen <[hidden email]> wrote:
Is that a new observation that it takes so long, or has it always taken so long?

On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <[hidden email]> wrote:
I figured the JM would be waiting for the TMs. Each of my nodes has 64G of memory available.

On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

My GPG Key ID: 336E2680

Stephan Ewen

Re: JM/TM startup time

Yeah, registration is fast, JVM heatup is what takes time.

You can try two things:

- Use the off-heap memory variant and see if that allocates the memory faster. Just add the entry "taskmanager.memory.off-heap: true" to the config.

- Or start the system in "streaming" mode. Then, it will not pre-allocate the memory, but allocate it as needed. That may, however, cause long GC pauses when the jobs start requesting the memory, so I am not sure if that is better.

Greetings,

Stephan

On Fri, Oct 2, 2015 at 5:53 PM, Robert Schmidtke <[hidden email]> wrote:

Looking into the logs of each TM it only took about 5 seconds per TM to go from "Trying to register" to "Successful registration".

On Fri, Oct 2, 2015 at 5:50 PM, Robert Schmidtke <[hidden email]> wrote:
I recently switched from running Flink on YARN to running Flink Standalone and I realized I had to add a sleep after ./start-cluster.sh (well, my Slurm adaptation of it). I did not have to explicitly wait before since Flink would wait until all YARN containers became available, so to be honest I don't know whether this is new or not. I just looked into an old log (well, from last Friday) and it took about 1 minute for 31 TMs to connect to 1 JM. They each had -Xms and -Xmx6079m though.

On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen <[hidden email]> wrote:
Is that a new observation that it takes so long, or has it always taken so long?

On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <[hidden email]> wrote:
I figured the JM would be waiting for the TMs. Each of my nodes has 64G of memory available.

On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

Robert Schmidtke

Re: JM/TM startup time

Thanks, the off-heap solution is indeed faster. 15s instead of 45s for the amounts of memory I allocate.

On Fri, Oct 2, 2015 at 6:09 PM, Stephan Ewen <[hidden email]> wrote:

Yeah, registration is fast, JVM heatup is what takes time.

You can try two things:

- Use the off-heap memory variant and see if that allocates the memory faster. Just add the entry "taskmanager.memory.off-heap: true" to the config.

- Or start the system in "streaming" mode. Then, it will not pre-allocate the memory, but allocate it as needed. That may, however, cause long GC pauses when the jobs start requesting the memory, so I am not sure if that is better.

Greetings,
Stephan

On Fri, Oct 2, 2015 at 5:53 PM, Robert Schmidtke <[hidden email]> wrote:
Looking into the logs of each TM it only took about 5 seconds per TM to go from "Trying to register" to "Successful registration".

On Fri, Oct 2, 2015 at 5:50 PM, Robert Schmidtke <[hidden email]> wrote:
I recently switched from running Flink on YARN to running Flink Standalone and I realized I had to add a sleep after ./start-cluster.sh (well, my Slurm adaptation of it). I did not have to explicitly wait before since Flink would wait until all YARN containers became available, so to be honest I don't know whether this is new or not. I just looked into an old log (well, from last Friday) and it took about 1 minute for 31 TMs to connect to 1 JM. They each had -Xms and -Xmx6079m though.

On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen <[hidden email]> wrote:
Is that a new observation that it takes so long, or has it always taken so long?

On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke <[hidden email]> wrote:
I figured the JM would be waiting for the TMs. Each of my nodes has 64G of memory available.

On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Robert,

During startup, the task manager allocates the entire managed memory.

From the log:
17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.7 of the currently free heap space for Flink
managed heap memory (34395 MB).

It seems like you are allocating almost 35 GB of memory which might
take a bit (40 seconds still seems like too much time). What
configuration did you use for the task managers? Do you really have
that much memory or is your system swapping?

I think the JobManager just appears to take a long time because the
TaskManagers register late.

Cheers,
Max

On Fri, Oct 2, 2015 at 5:26 PM, Robert Schmidtke <[hidden email]> wrote:

> Hi everyone,
>
> I'm wondering about the startup times of the TMs:
>
> ...
> 17:03:33,255 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor
> 17:03:33,262 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
> - NettyConfig [server address: cumu02-05/130.73.144.64, server port: 45731,
> memory segment size (bytes): 32768, transport type: NIO, number of server
> threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
> default), server connect backlog: 0 (use Netty's default), client connect
> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
> default)]
> 17:03:33,266 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Messages between TaskManager and JobManager have a max timeout of 100000
> milliseconds
> 17:03:33,268 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Temporary file directory '/tmp': total 44 GB, usable 37 GB (84.09% usable)
> 17:03:33,295 INFO
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64
> MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 17:03:33,554 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Using 0.7 of the currently free heap space for Flink managed heap memory
> (34395 MB).
>
> // almost 40 seconds //
>
> 17:04:12,445 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
> - I/O manager uses directory
> /tmp/flink-io-922d9bf4-254e-41e7-b151-525157cd5bfe for spill files.
> 17:04:12,455 INFO org.apache.flink.runtime.filecache.FileCache
> - User file cache uses directory
> /tmp/flink-dist-cache-792cf7f2-e2be-4950-a39f-d7a21326f054
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Starting TaskManager actor at akka://flink/user/taskmanager#1341641688.
> 17:04:12,617 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager data connection information: cumu02-05.zib.de (dataPort=45731)
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - TaskManager has 16 task slot(s).
> 17:04:12,618 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Memory usage stats: [HEAP: 35502/49216/49216 MB, NON HEAP: 25/52/214 MB
> (used/committed/max)]
> 17:04:12,623 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Trying to register at JobManager
> akka.tcp://flink@130.73.144.59:6123/user/jobmanager (attempt 1, timeout: 500
> milliseconds)
> 17:04:12,773 INFO org.apache.flink.runtime.taskmanager.TaskManager
> - Successful registration at JobManager
> (akka.tcp://flink@130.73.144.59:6123/user/jobmanager), starting network
> stack and library cache.
> ...
>
>
> The same goes for the JM (obviously).
>
> ...
> 17:03:31,632 INFO org.apache.flink.runtime.jobmanager.JobManager
> - Starting JobManger web frontend
> 17:03:31,636 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Setting up web info server, using web-root directory
> jar:file:/nfs/csr/bzcschmi/flink/flink-dist/target/flink-0.10-SNAPSHOT-bin/flink-0.10-SNAPSHOT/lib/flink-dist-0.10-SNAPSHOT.jar!/web-docs-infoserver.
> 17:03:31,753 INFO org.eclipse.jetty.util.log
> - jetty-0.10-SNAPSHOT
> 17:03:31,806 INFO org.eclipse.jetty.util.log
> - Started SelectChannelConnector@0.0.0.0:8081
> 17:03:31,806 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
> - Started web info server for JobManager on 0.0.0.0:8081
>
> // almost 35 seconds //
>
> 17:04:05,091 INFO org.apache.flink.runtime.instance.InstanceManager
> - Registered TaskManager at cumu02-02
> (akka.tcp://flink@130.73.144.61:53549/user/taskmanager) as
> e5ae92397a912c7360524524cf2d172a. Current number of registered hosts is 1.
> Current number of alive task slots is 16.
> ...
>
>
> Is this to be expected? Any ideas what's happening in the meantime? I'm
> asking because I'm running into errors when submitting my job too early (and
> not enough TMs have connected).
>
> Cheers
> Robert
>
> --
> My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

My GPG Key ID: 336E2680