Problem in Flink 1.3.2 with Mesos task managers offers

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem in Flink 1.3.2 with Mesos task managers offers

Francisco Gonzalez
Hello guys,

We have a flink 1.3.2 session deployed from Marathon json to Mesos with some of the following parameters as environment variables:


"flink_mesos.initial-tasks": "8",
"flink_mesos.resourcemanager.tasks.mem": "4096",

And other environment variables including zookeeper, etc.

The mesos cluster is used for diferents applications (kafka, ad-hoc...), and have fragmentation into the agents. Our problem is that the flink session is getting all offers, even small ones. In case there are not enough offers to suit that configuration, it gets all of them, so there are no resources and offers free for other applications.

So the question would be what is the right configuration in these cases to avoid using all resources for the same flink session.

Thanks in advance.
Regards

This message is private and confidential. If you have received this message in error, please notify the sender or [hidden email] and remove it from your system.

Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339

Reply | Threaded
Open this post in threaded view
|

Re: Problem in Flink 1.3.2 with Mesos task managers offers

Eron Wright
Hello, the current behavior is that Flink holds onto received offers for up to two minutes while it attempts to provision the TMs.   Flink can combine small offers to form a single TM, to combat fragmentation that develops over time in a Mesos cluster.   Are you saying that unused offers aren't being released after two minutes?

There's a log entry you should see in the JM log whenever an offer is released:
LOG.info(s"Declined offer ${lease.getId} from ${lease.hostname()} "
  + s"of ${lease.memoryMB()} MB, ${lease.cpuCores()} cpus.")

The timeout value isn't configurable at the moment, but if you're willing to experiment by building Flink from source, you may adjust the two minute timeout to something lower as follows.   In the `MesosFlinkResourceManager` class, edit the `createOptimizer` method to call `withLeaseOfferExpirySecs` on the `TaskScheduler.Builder` object.

Let us know if that helps and we'll make the timeout configurable.
-Eron

On Tue, Sep 19, 2017 at 8:58 AM, Francisco Gonzalez Barea <[hidden email]> wrote:
Hello guys,

We have a flink 1.3.2 session deployed from Marathon json to Mesos with some of the following parameters as environment variables:


"flink_mesos.initial-tasks": "8",
"flink_mesos.resourcemanager.tasks.mem": "4096",

And other environment variables including zookeeper, etc.

The mesos cluster is used for diferents applications (kafka, ad-hoc...), and have fragmentation into the agents. Our problem is that the flink session is getting all offers, even small ones. In case there are not enough offers to suit that configuration, it gets all of them, so there are no resources and offers free for other applications.

So the question would be what is the right configuration in these cases to avoid using all resources for the same flink session.

Thanks in advance.
Regards

This message is private and confidential. If you have received this message in error, please notify the sender or [hidden email] and remove it from your system.

Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339


Reply | Threaded
Open this post in threaded view
|

Re: Problem in Flink 1.3.2 with Mesos task managers offers

Francisco Gonzalez
Hello Eron,

Thank you for your reply, we will take a look at this.

Regards


On 19 Sep 2017, at 22:37, Eron Wright <[hidden email]> wrote:

Hello, the current behavior is that Flink holds onto received offers for up to two minutes while it attempts to provision the TMs.   Flink can combine small offers to form a single TM, to combat fragmentation that develops over time in a Mesos cluster.   Are you saying that unused offers aren't being released after two minutes?

There's a log entry you should see in the JM log whenever an offer is released:
LOG.info(s"Declined offer ${lease.getId} from ${lease.hostname()} "
  + s"of ${lease.memoryMB()} MB, ${lease.cpuCores()} cpus.")

The timeout value isn't configurable at the moment, but if you're willing to experiment by building Flink from source, you may adjust the two minute timeout to something lower as follows.   In the `MesosFlinkResourceManager` class, edit the `createOptimizer` method to call `withLeaseOfferExpirySecs` on the `TaskScheduler.Builder` object.

Let us know if that helps and we'll make the timeout configurable.
-Eron

On Tue, Sep 19, 2017 at 8:58 AM, Francisco Gonzalez Barea <[hidden email]> wrote:
Hello guys,

We have a flink 1.3.2 session deployed from Marathon json to Mesos with some of the following parameters as environment variables:


"flink_mesos.initial-tasks": "8",
"flink_mesos.resourcemanager.tasks.mem": "4096",

And other environment variables including zookeeper, etc.

The mesos cluster is used for diferents applications (kafka, ad-hoc...), and have fragmentation into the agents. Our problem is that the flink session is getting all offers, even small ones. In case there are not enough offers to suit that configuration, it gets all of them, so there are no resources and offers free for other applications.

So the question would be what is the right configuration in these cases to avoid using all resources for the same flink session.

Thanks in advance.
Regards

This message is private and confidential. If you have received this message in error, please notify the sender or [hidden email] and remove it from your system.

Piksel Inc is a company registered in the United States, <a href="https://maps.google.com/?q=2100&#43;Powers&#43;Ferry&#43;Road&#43;SE,&#43;Suite&#43;400,&#43;Atlanta,&#43;GA&#43;30339&amp;entry=gmail&amp;source=g" class=""> 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339