Problems with taskmanagers in Mesos Cluster

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems with taskmanagers in Mesos Cluster

Manuel Montesino
Hi,

We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.

There is a way that kill or stop the taskmanagers when the session is redeployed?

Some environment configuration from marathon json file related to taskmanagers:

```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```

Thanks in advance and kind regards,

Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590

liberating viewingpiksel.com

Piksel_Email.png

This message is private and confidential. If you have received this message in error, please notify the sender or [hidden email] and remove it from your system.

Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339

Reply | Threaded
Open this post in threaded view
|

Re: Problems with taskmanagers in Mesos Cluster

Eron Wright
If I understand you correctly, the high-availability path isn't being changed but other TM-related settings are, and the recovered TMs aren't picking up the new configuration.   I don't think that Flink supports on-the-fly reconfiguration of a Task Manager at this time.

As a workaround, to achieve a clean new session when you reconfigure Flink via Marathon, update the HA path accordingly.

Would that work for you?



On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <[hidden email]> wrote:
Hi,

We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.

There is a way that kill or stop the taskmanagers when the session is redeployed?

Some environment configuration from marathon json file related to taskmanagers:

```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```

Thanks in advance and kind regards,

Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590

liberating viewingpiksel.com

Piksel_Email.png

This message is private and confidential. If you have received this message in error, please notify the sender or [hidden email] and remove it from your system.

Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339


Reply | Threaded
Open this post in threaded view
|

Re: Problems with taskmanagers in Mesos Cluster

Manuel Montesino

Hi Eron,


Thanks for your response.


Maybe I'm not explaining well. The thing is that when we redepoy a flink session, not kill or stop the active taskmanagers and create/start new ones (those with new configuration), that's what we want (a full redeploy) so there are not recovered TM, still the sames with same configuration.


If we change the zk high availability name, the TK will be orphans in Mesos, creating a new ones and we don't want that.


Another thing is the way we are re-deploying. We have developed an script to deploy flink jobs from flink's api (we have a pipeline to do all this operations), in this script we stop/kill the session with /cancel or /cancel-with-savepoint api methods.


Maybe is clear now?.


Thanks in advance.


Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590

liberating viewingpiksel.com

Piksel_Email.png

De: Eron Wright <[hidden email]>
Enviado: lunes, 23 de octubre de 2017 19:03:50
Para: Manuel Montesino
Cc: [hidden email]; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster
 
If I understand you correctly, the high-availability path isn't being changed but other TM-related settings are, and the recovered TMs aren't picking up the new configuration.   I don't think that Flink supports on-the-fly reconfiguration of a Task Manager at this time.

As a workaround, to achieve a clean new session when you reconfigure Flink via Marathon, update the HA path accordingly.

Would that work for you?



On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <[hidden email]> wrote:
Hi,

We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.

There is a way that kill or stop the taskmanagers when the session is redeployed?

Some environment configuration from marathon json file related to taskmanagers:

```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```

Thanks in advance and kind regards,

Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590

liberating viewingpiksel.com

Piksel_Email.png

This message is private and confidential. If you have received this message in error, please notify the sender or [hidden email] and remove it from your system.

Piksel Inc is a company registered in the United States, <a href="https://maps.google.com/?q=2100&#43;Powers&#43;Ferry&#43;Road&#43;SE,&#43;Suite&#43;400,&#43;Atlanta,&#43;GA&#43;30339&amp;entry=gmail&amp;source=g"> 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339


Reply | Threaded
Open this post in threaded view
|

Re: Problems with taskmanagers in Mesos Cluster

Manuel Montesino

Sorry, forget about the api methods comment, that is for  flink jobs.


For flink session, we do a deploy directly to marathon and is marathon that manage the job... that's the reason that restart the jobmanager and not the taskmanagers, because the taskmanagers are created by flink connecting to mesos directly and marathon don't know any relation between the marathon job and the mesos tasks of flink taskmanagers.


Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590

liberating viewingpiksel.com

Piksel_Email.png

De: Manuel Montesino
Enviado: miércoles, 25 de octubre de 2017 11:27:22
Para: Eron Wright
Cc: [hidden email]; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster
 

Hi Eron,


Thanks for your response.


Maybe I'm not explaining well. The thing is that when we redepoy a flink session, not kill or stop the active taskmanagers and create/start new ones (those with new configuration), that's what we want (a full redeploy) so there are not recovered TM, still the sames with same configuration.


If we change the zk high availability name, the TK will be orphans in Mesos, creating a new ones and we don't want that.


Another thing is the way we are re-deploying. We have developed an script to deploy flink jobs from flink's api (we have a pipeline to do all this operations), in this script we stop/kill the session with /cancel or /cancel-with-savepoint api methods.


Maybe is clear now?.


Thanks in advance.


Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590

liberating viewingpiksel.com

Piksel_Email.png

De: Eron Wright <[hidden email]>
Enviado: lunes, 23 de octubre de 2017 19:03:50
Para: Manuel Montesino
Cc: [hidden email]; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster
 
If I understand you correctly, the high-availability path isn't being changed but other TM-related settings are, and the recovered TMs aren't picking up the new configuration.   I don't think that Flink supports on-the-fly reconfiguration of a Task Manager at this time.

As a workaround, to achieve a clean new session when you reconfigure Flink via Marathon, update the HA path accordingly.

Would that work for you?



On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <[hidden email]> wrote:
Hi,

We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.

There is a way that kill or stop the taskmanagers when the session is redeployed?

Some environment configuration from marathon json file related to taskmanagers:

```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```

Thanks in advance and kind regards,

Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590

liberating viewingpiksel.com

Piksel_Email.png

This message is private and confidential. If you have received this message in error, please notify the sender or [hidden email] and remove it from your system.

Piksel Inc is a company registered in the United States, <a href="https://maps.google.com/?q=2100&#43;Powers&#43;Ferry&#43;Road&#43;SE,&#43;Suite&#43;400,&#43;Atlanta,&#43;GA&#43;30339&amp;entry=gmail&amp;source=g"> 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339