Crash in a simple "mapper style" streaming app likely due to a memory leak ?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Crash in a simple "mapper style" streaming app likely due to a memory leak ?

LINZ, Arnaud

Hello,

 

I use the brand new 0.10 version and I have problems running a streaming execution. My topology is linear : a custom source SC scans a directory and emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; a filter F filters lines ; another mapper M2 transforms them ; and a mapper/sink M3->SK stores them in HDFS.

 

SC->M1->F->M2->M3->SK

 

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row static table inside a hash map to enrich the lines. I use 55 slots on Yarn, using 11 containers of 12Gb x 5 slots

 

To my understanding, I should not have any memory problem since each record is independent : no join, no key, no aggregation, no window => it’s a simple flow mapper, with RAM simply used as a buffer. However, if I submit enough input data, I systematically crash my app with “Connection unexpectedly closed by remote task manager” exception, and the first error in YARN log shows that “a container is running beyond physical memory limits”.

 

If I increase the container size, I simply need to feed in more data to get the crash happen.

 

Any idea?

 

Greetings,

Arnaud

 

_________________________________

Exceptions in Flink dashboard detail :

 

Root Exception :

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

       at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

       at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

       at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

       at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

       at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

       at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

       at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306)

       at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

       at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

       at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)

       at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)

       at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)

       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)

       at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

       at java.lang.Thread.run(Thread.java:744)

 

 

Workers exception :

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager afe405e5c982a0317dd5b98108a33eef @ h1r2dn09 - 5 slots - URL: akka.tcp://flink@172.21.125.31:42012/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:152)

        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Yarn log extract :

 

13:46:18,346 INFO  org.apache.flink.yarn.ApplicationMaster                       - YARN daemon runs as yarn setting user to execute Flink ApplicationMaster/JobManager to datcrypt

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       - --------------------------------------------------------------------------------

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Starting YARN ApplicationMaster/JobManager (Version: 0.10.0, Rev:ab2cca4, Date:10.11.2015 @ 13:50:14 UTC)

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Current user: yarn

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.7/24.45-b08

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Maximum heap size: 1388 MiBytes

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  JAVA_HOME: /usr/java/default

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Hadoop version: 2.6.0

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -  JVM Options:

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Xmx1448M

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Dlog.file=/data/7/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/jobmanager.log

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Dlogback.configurationFile=file:logback.xml

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Dlog4j.configuration=file:log4j.properties

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Program Arguments: (none)

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       - --------------------------------------------------------------------------------

13:46:18,355 INFO  org.apache.flink.yarn.ApplicationMaster                       - registered UNIX signal handlers for [TERM, HUP, INT]

13:46:18,375 INFO  org.apache.flink.yarn.ApplicationMaster                       - Starting ApplicationMaster/JobManager in streaming mode

13:46:18,376 INFO  org.apache.flink.yarn.ApplicationMaster                       - Loading config from: /data/10/hadoop/yarn/local/usercache/datcrypt/appcache/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001.

13:46:18,416 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager

13:46:18,421 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 172.21.125.28:0

13:46:19,022 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started

13:46:19,081 INFO  Remoting                                                      - Starting remoting

13:46:19,248 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@172.21.125.28:50891]

13:46:19,256 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend

13:46:19,283 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-cfd99c17-62ed-4bc8-b2d5-9fde1248ee75 for the web interface files

13:46:19,284 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /data/7/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/jobmanager.log

13:46:19,284 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /data/7/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/jobmanager.out

13:46:19,614 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:55890

13:46:19,616 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor

13:46:19,623 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-1dc260e8-ff31-4b07-9bf6-8630504fd2c6

13:46:19,624 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:35970 - max concurrent requests: 50 - max backlog: 1000

13:46:19,643 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@172.21.125.28:50891/user/jobmanager on port 55890

13:46:19,643 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.21.125.28:50891/user/jobmanager:null.

13:46:19,644 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive

13:46:19,644 INFO  org.apache.flink.yarn.YarnJobManager                          - Starting JobManager at akka.tcp://flink@172.21.125.28:50891/user/jobmanager.

13:46:19,652 INFO  org.apache.flink.yarn.ApplicationMaster                       - Generate configuration file for application master.

13:46:19,656 INFO  org.apache.flink.yarn.YarnJobManager                          - JobManager akka.tcp://flink@172.21.125.28:50891/user/jobmanager was granted leadership with leader session ID None.

13:46:19,668 INFO  org.apache.flink.yarn.ApplicationMaster                       - Starting YARN session on Job Manager.

13:46:19,668 INFO  org.apache.flink.yarn.ApplicationMaster                       - Application Master properly initiated. Awaiting termination of actor system.

13:46:19,680 INFO  org.apache.flink.yarn.YarnJobManager                          - Start yarn session.

13:46:19,762 INFO  org.apache.flink.yarn.YarnJobManager                          - Yarn session with 11 TaskManagers. Tolerating 11 failed TaskManagers

13:46:20,471 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting to ResourceManager at h1r1nn01.bpa.bouyguestelecom.fr/172.21.125.3:8030

13:46:20,502 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - yarn.client.max-cached-nodemanagers-proxies : 0

13:46:20,503 INFO  org.apache.flink.yarn.YarnJobManager                          - Registering ApplicationMaster with tracking url http://h1r2dn06.bpa.bouyguestelecom.fr:55890.

13:46:20,675 INFO  org.apache.flink.yarn.YarnJobManager                          - Retrieved 0 TaskManagers from previous attempts.

13:46:20,680 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 0.

13:46:20,685 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 1.

13:46:20,685 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 2.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 3.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 4.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 5.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 6.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 7.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 8.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 9.

13:46:20,687 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 10.

13:46:20,737 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:/data/10/hadoop/yarn/local/usercache/datcrypt/appcache/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/flink-conf-modified.yaml to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/datcrypt/.flink/application_1444907929501_7416/flink-conf-modified.yaml

13:46:20,983 INFO  org.apache.flink.yarn.YarnJobManager                          - Prepared local resource for modified yaml: resource { scheme: "hdfs" host: "h1r1nn01.bpa.bouyguestelecom.fr" port: 8020 file: "/user/datcrypt/.flink/application_1444907929501_7416/flink-conf-modified.yaml" } size: 5209 timestamp: 1447418780914 type: FILE visibility: APPLICATION

13:46:20,990 INFO  org.apache.flink.yarn.YarnJobManager                          - Create container launch context.

13:46:20,999 INFO  org.apache.flink.yarn.YarnJobManager                          - Starting TM with command=$JAVA_HOME/bin/java -Xms9216m -Xmx9216m -XX:MaxDirectMemorySize=9216m  -Dlog.file="<LOG_DIR>/taskmanager.log" -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err --streamingMode streaming

13:46:21,551 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 0 running. 11 containers missing

13:46:22,085 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn11.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn08.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn06.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn04.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn02.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn03.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn08.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn09.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn12.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn07.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn11.bpa.bouyguestelecom.fr:45454

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000002

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000003

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000004

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000005

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000006

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000007

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000008

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000009

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000010

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000011

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000012

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 0 running. 11 containers missing

13:46:22,104 INFO  org.apache.flink.yarn.YarnJobManager                          - 11 containers already allocated by YARN. Starting...

13:46:22,106 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn11.bpa.bouyguestelecom.fr:45454

13:46:22,151 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000002 on host h1r1dn11.bpa.bouyguestelecom.fr).

13:46:22,152 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn08.bpa.bouyguestelecom.fr:45454

13:46:22,162 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000003 on host h1r2dn08.bpa.bouyguestelecom.fr).

13:46:22,162 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn06.bpa.bouyguestelecom.fr:45454

13:46:22,171 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000004 on host h1r2dn06.bpa.bouyguestelecom.fr).

13:46:22,172 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn04.bpa.bouyguestelecom.fr:45454

13:46:22,182 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000005 on host h1r1dn04.bpa.bouyguestelecom.fr).

13:46:22,183 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn02.bpa.bouyguestelecom.fr:45454

13:46:22,191 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000006 on host h1r1dn02.bpa.bouyguestelecom.fr).

13:46:22,192 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn03.bpa.bouyguestelecom.fr:45454

13:46:22,202 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000007 on host h1r2dn03.bpa.bouyguestelecom.fr).

13:46:22,202 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn08.bpa.bouyguestelecom.fr:45454

13:46:22,257 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000008 on host h1r1dn08.bpa.bouyguestelecom.fr).

13:46:22,258 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn09.bpa.bouyguestelecom.fr:45454

13:46:22,267 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000009 on host h1r2dn09.bpa.bouyguestelecom.fr).

13:46:22,268 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn12.bpa.bouyguestelecom.fr:45454

13:46:22,279 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000010 on host h1r1dn12.bpa.bouyguestelecom.fr).

13:46:22,280 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn07.bpa.bouyguestelecom.fr:45454

13:46:22,290 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000011 on host h1r2dn07.bpa.bouyguestelecom.fr).

13:46:22,290 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn11.bpa.bouyguestelecom.fr:45454

13:46:22,298 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000012 on host h1r2dn11.bpa.bouyguestelecom.fr).

13:46:26,430 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn06 (akka.tcp://flink@172.21.125.28:36435/user/taskmanager) as ec3fba3dc109840893fdefadf9d25769. Current number of registered hosts is 1. Current number of alive task slots is 5.

13:46:26,570 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn08 (akka.tcp://flink@172.21.125.18:49837/user/taskmanager) as d118671119071c18654fd99ce0ef2b26. Current number of registered hosts is 2. Current number of alive task slots is 10.

13:46:26,580 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn11 (akka.tcp://flink@172.21.125.33:45822/user/taskmanager) as f90850c024cdcf76f97d6a6f80aa656b. Current number of registered hosts is 3. Current number of alive task slots is 15.

13:46:26,650 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn08 (akka.tcp://flink@172.21.125.30:52119/user/taskmanager) as 61397a13eccdddb37ff71e28b6655c3c. Current number of registered hosts is 4. Current number of alive task slots is 20.

13:46:26,819 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn12 (akka.tcp://flink@172.21.125.22:42508/user/taskmanager) as 2a1c431c9c5727a727c685eb2bbf2ab0. Current number of registered hosts is 5. Current number of alive task slots is 25.

13:46:26,874 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn07 (akka.tcp://flink@172.21.125.29:43648/user/taskmanager) as 46960d7a66ba385f359a4fdc86fae507. Current number of registered hosts is 6. Current number of alive task slots is 30.

13:46:26,913 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn09 (akka.tcp://flink@172.21.125.31:42048/user/taskmanager) as 44bf60d7f6bc695548b06953ce8c7a01. Current number of registered hosts is 7. Current number of alive task slots is 35.

13:46:27,083 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn02 (akka.tcp://flink@172.21.125.12:46609/user/taskmanager) as d4c153b853ad95e16b3d9f311381d56d. Current number of registered hosts is 8. Current number of alive task slots is 40.

13:46:27,161 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn04 (akka.tcp://flink@172.21.125.14:58610/user/taskmanager) as 051a3c73e96111f0ab04608583c317ff. Current number of registered hosts is 9. Current number of alive task slots is 45.

13:46:27,496 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn11 (akka.tcp://flink@172.21.125.21:36403/user/taskmanager) as 098dee828632d7f3ddc2877e997df866. Current number of registered hosts is 10. Current number of alive task slots is 50.

13:46:29,159 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn03 (akka.tcp://flink@172.21.125.25:58890/user/taskmanager) as bb8607fb775d2219ed384c7f15bda19e. Current number of registered hosts is 11. Current number of alive task slots is 55.

 

13:49:16,869 INFO  org.apache.flink.yarn.YarnJobManager                          - Submitting job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT).

13:49:16,873 INFO  org.apache.flink.yarn.YarnJobManager                          - Scheduling job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT).

13:49:16,873 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from CREATED to SCHEDULED

13:49:16,873 INFO  org.apache.flink.yarn.YarnJobManager                          - Status of job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT) changed to RUNNING.

13:49:16,874 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from SCHEDULED to DEPLOYING

13:49:16,874 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (attempt #0) to h1r2dn09

13:49:16,874 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from CREATED to SCHEDULED

13:49:16,875 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from SCHEDULED to DEPLOYING

13:49:16,875 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (attempt #0) to h1r2dn09

13:49:16,876 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from CREATED to SCHEDULED

13:49:16,876 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from SCHEDULED to DEPLOYING

 

(…)

(The apps runs well)

(…)

 

14:18:27,692 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@172.21.125.31:42048] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14:18:29,987 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000009 is completed with diagnostics: Container [pid=44905,containerID=container_e05_1444907929501_7416_01_000009] is running beyond physical memory limits. Current usage: 12.1 GB of 12 GB physical memory used; 15.9 GB of 25.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_e05_1444907929501_7416_01_000009 :
         |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
         |- 44905 44903 44905 44905 (bash) 0 0 108650496 311 /bin/bash -c /usr/java/default/bin/java -Xms9216m -Xmx9216m -XX:MaxDirectMemorySize=9216m  -Dlog.file=/data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> /data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.out 2> /data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.err --streamingMode streaming 
         |- 44914 44905 44905 44905 (java) 345746 7991 17012555776 3165572 /usr/java/default/bin/java -Xms9216m -Xmx9216m -XX:MaxDirectMemorySize=9216m -Dlog.file=/data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . --streamingMode streaming 
 
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

14:18:29,988 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000009 was a running container. Total failed containers 1.

14:18:29,988 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 10 running. 1 containers missing

14:18:29,989 INFO  org.apache.flink.yarn.YarnJobManager                          - There are 1 containers missing. 0 are already requested. Requesting 1 additional container(s) from YARN. Reallocation of failed containers is enabled=true ('yarn.reallocate-failed')

14:18:29,990 INFO  org.apache.flink.yarn.YarnJobManager                          - Requested additional container from YARN. Pending requests 1.

14:18:30,503 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 10 running. 1 containers missing

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn10.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn05.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn02.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn12.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn04.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000013

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000014

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000015

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000016

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000017

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000018

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000019

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000020

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000021

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000022

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 10 running. 1 containers missing

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - 10 containers already allocated by YARN. Starting...

14:18:31,030 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn03.bpa.bouyguestelecom.fr:45454

14:18:31,041 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000013 on host h1r2dn03.bpa.bouyguestelecom.fr).

14:18:31,042 INFO  org.apache.flink.yarn.YarnJobManager                          - Flink has 9 allocated containers which are not needed right now. Returning them

14:18:33,030 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (20/50) (7f278c64beacbad498199a4e7cb1228f) switched from RUNNING to FAILED

14:18:33,032 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from RUNNING to CANCELING

14:18:33,032 INFO  org.apache.flink.yarn.YarnJobManager                          - Status of job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT) changed to FAILING.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)

    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)

    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)

    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

    at java.lang.Thread.run(Thread.java:744)

14:18:33,032 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 

14:18:33,032 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (3/50) (c586e6e833e331aaa48a68d57f9252bf) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (4/50) (5ed0f29817b241e9eb04a8876fc76737) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (5/50) (b572b973c66f3052c431fe6c10e5f1ff) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (6/50) (48d3e7ba592b47add633f6e65bf4fba7) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (7/50) (66b0717c028e19a4df23dee6341b56cd) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (8/50) (193fd3a321a5c5711d369691544fa7f7) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (9/50) (91a0e1349ee875d8f4786a204f6096a3) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (10/50) (1ddea0d9b8fb4a58e8f3150756267b9d) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (11/50) (bc26fe9aaf621f9d9c8e5358d2f1208f) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (12/50) (d71455ecb37eeb387a1da56a2f5964de) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (13/50) (a57717ee1b7e9e91091429b4a3c7dbb3) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (14/50) (8b1d11ed1a3cc51ba981b1006b7eb757) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (15/50) (f3e8182fca0651ca2c8a70bd660695c3) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (16/50) (bf5e18f069b74fc284bcf7b324c30277) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (17/50) (fcf8dba87b366a50a9c048dcca6cc18a) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (18/50) (811bf08e6b0c801dee0e4092777d3802) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (19/50) (93fb94616a05b5a0fa0ca7e553778565) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (21/50) (28193eaa9448382b8f462189400519c6) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (22/50) (d6548a95eb0d96c130c42bbe3cfa2c0b) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (23/50) (1e9f2598f8bb97d494bb3fe16a9577b2) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (24/50) (1aca8044e5026f741089945b02987435) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (25/50) (42c877823146567658ad6be8f5825744) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (26/50) (9883561c52a4c7ae4decdcc3ce9a371f) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (27/50) (5875ea808ef3dc5c561066018dc87e28) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (28/50) (fd88b4c83cf67be6a6e6f73b7a5b83b3) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (29/50) (5b8570f5447752c74c904e6ee73c5e20) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (30/50) (a0c2817b7fa7c4c74c74d0273bd6422c) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (31/50) (b9bbfe013b3f4e6cd3203dd8d6440383) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (32/50) (60686a6909ab09b03ea6ee8fd87c8a99) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (33/50) (e1735a0ce66449f357936dcc147053b9) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (34/50) (a368834aec80833291d4a4a5cf974b5d) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (35/50) (d37e0742c5fbf0798c02f1384e8f9bc0) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (36/50) (6154f1df4b63ad96a7d4d03b2a633fb1) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (37/50) (02f33613549b380fd901fcef94136eaa) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (38/50) (34aaaa6576c48b55225e2b446d663621) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (39/50) (b1d4c9aec8114c3889e2c02da0e3d41d) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (40/50) (63833c22e3b6b79d15f12c5b5bb6a70b) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (41/50) (25c16c9065a5486f0e8dcd0cc3f9c6d1) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (42/50) (7b03ade33a4048d3ceed773fda2f9b23) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (43/50) (8424b4ef2960c29fdcae110e565b0316) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (44/50) (921a6a7068b2830c4cb4dd5445cae1f2) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (45/50) (e197594d273125a930335d93cf51254b) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (46/50) (885e3d1c85765a9270fdee1bfd9b9852) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (47/50) (146174eb5be9dbe2296d806783f61868) switched from RUNNING to CANCELING

14:18:33,042 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (48/50) (9e205e3d4bf6c665c3b817c3b622fef4) switched from RUNNING to CANCELING

14:18:33,042 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (49/50) (2650dc529b7149928d41ec35096fd475) switched from RUNNING to CANCELING

14:18:33,042 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (50/50) (93038fac2c92e2b79623579968f9a0da) switched from RUNNING to CANCELING

14:18:33,063 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.21.125.31:42048]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.21.125.31:42048

14:18:33,067 INFO  org.apache.flink.yarn.YarnJobManager                          - Task manager akka.tcp://flink@172.21.125.31:42048/user/taskmanager terminated.

14:18:33,068 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (4/50) (5ed0f29817b241e9eb04a8876fc76737) switched from CANCELING to FAILED

14:18:33,068 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from CANCELING to FAILED

14:18:33,069 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from CANCELING to FAILED

14:18:33,069 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from CANCELING to FAILED

14:18:33,070 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (5/50) (b572b973c66f3052c431fe6c10e5f1ff) switched from CANCELING to FAILED

14:18:33,070 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (3/50) (c586e6e833e331aaa48a68d57f9252bf) switched from CANCELING to FAILED

14:18:33,071 INFO  org.apache.flink.runtime.instance.InstanceManager             - Unregistered task manager akka.tcp://flink@172.21.125.31:42048/user/taskmanager. Number of registered task managers 10. Number of available slots 50.

14:18:33,332 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (8/50) (193fd3a321a5c5711d369691544fa7f7) switched from CANCELING to CANCELED

14:18:33,403 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (11/50) (bc26fe9aaf621f9d9c8e5358d2f1208f) switched from CANCELING to CANCELED

14:18:34,050 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (24/50) (1aca8044e5026f741089945b02987435) switched from CANCELING to CANCELED

14:18:34,522 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (35/50) (d37e0742c5fbf0798c02f1384e8f9bc0) switched from CANCELING to CANCELED

14:18:34,683 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (48/50) (9e205e3d4bf6c665c3b817c3b622fef4) switched from CANCELING to CANCELED

14:18:34,714 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (16/50) (bf5e18f069b74fc284bcf7b324c30277) switched from CANCELING to CANCELED

14:18:34,919 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn03 (akka.tcp://flink@172.21.125.25:44296/user/taskmanager) as d65581cce8a29775ba3df10d02756d11. Current number of registered hosts is 11. Current number of alive task slots is 55.

14:18:35,410 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (31/50) (b9bbfe013b3f4e6cd3203dd8d6440383) switched from CANCELING to CANCELED

14:18:35,513 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (30/50) (a0c2817b7fa7c4c74c74d0273bd6422c) switched from CANCELING to CANCELED

14:18:35,739 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (7/50) (66b0717c028e19a4df23dee6341b56cd) switched from CANCELING to CANCELED

14:18:35,896 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (41/50) (25c16c9065a5486f0e8dcd0cc3f9c6d1) switched from CANCELING to CANCELED

14:18:35,948 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (13/50) (a57717ee1b7e9e91091429b4a3c7dbb3) switched from CANCELING to CANCELED

14:18:36,073 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn09.bpa.bouyguestelecom.fr:45454

14:18:36,073 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn01.bpa.bouyguestelecom.fr:45454

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000023

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000024

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000017 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000014 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000018 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000022 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000019 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000020 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000016 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000015 is completed with diagnostics: Container released by application

14:18:36,075 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000021 is completed with diagnostics: Container released by application

14:18:36,075 INFO  org.apache.flink.yarn.YarnJobManager                          - Flink has 2 allocated containers which are not needed right now. Returning them

14:18:36,197 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (6/50) (48d3e7ba592b47add633f6e65bf4fba7) switched from CANCELING to CANCELED

14:18:36,331 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (38/50) (34aaaa6576c48b55225e2b446d663621) switched from CANCELING to CANCELED

14:18:36,540 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (34/50) (a368834aec80833291d4a4a5cf974b5d) switched from CANCELING to CANCELED

14:18:36,645 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (46/50) (885e3d1c85765a9270fdee1bfd9b9852) switched from CANCELING to CANCELED

14:18:37,084 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (22/50) (d6548a95eb0d96c130c42bbe3cfa2c0b) switched from CANCELING to CANCELED

14:18:37,398 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (39/50) (b1d4c9aec8114c3889e2c02da0e3d41d) switched from CANCELING to CANCELED

14:18:37,834 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (28/50) (fd88b4c83cf67be6a6e6f73b7a5b83b3) switched from CANCELING to CANCELED

14:18:38,433 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (14/50) (8b1d11ed1a3cc51ba981b1006b7eb757) switched from CANCELING to CANCELED

14:18:38,542 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (42/50) (7b03ade33a4048d3ceed773fda2f9b23) switched from CANCELING to CANCELED

14:18:39,011 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (12/50) (d71455ecb37eeb387a1da56a2f5964de) switched from CANCELING to CANCELED

14:18:39,088 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (44/50) (921a6a7068b2830c4cb4dd5445cae1f2) switched from CANCELING to CANCELED

14:18:39,194 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (10/50) (1ddea0d9b8fb4a58e8f3150756267b9d) switched from CANCELING to CANCELED

14:18:39,625 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (21/50) (28193eaa9448382b8f462189400519c6) switched from CANCELING to CANCELED

14:18:40,299 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (29/50) (5b8570f5447752c74c904e6ee73c5e20) switched from CANCELING to CANCELED

14:18:40,572 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (18/50) (811bf08e6b0c801dee0e4092777d3802) switched from CANCELING to CANCELED

14:18:40,985 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (23/50) (1e9f2598f8bb97d494bb3fe16a9577b2) switched from CANCELING to CANCELED

14:18:41,094 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000024 is completed with diagnostics: Container released by application

14:18:41,094 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000023 is completed with diagnostics: Container released by application

14:18:41,162 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (15/50) (f3e8182fca0651ca2c8a70bd660695c3) switched from CANCELING to CANCELED

14:18:41,625 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (40/50) (63833c22e3b6b79d15f12c5b5bb6a70b) switched from CANCELING to CANCELED

14:18:41,710 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (47/50) (146174eb5be9dbe2296d806783f61868) switched from CANCELING to CANCELED

14:18:41,828 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (37/50) (02f33613549b380fd901fcef94136eaa) switched from CANCELING to CANCELED

14:18:41,932 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (27/50) (5875ea808ef3dc5c561066018dc87e28) switched from CANCELING to CANCELED

14:18:41,947 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (19/50) (93fb94616a05b5a0fa0ca7e553778565) switched from CANCELING to CANCELED

14:18:42,806 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (49/50) (2650dc529b7149928d41ec35096fd475) switched from CANCELING to CANCELED

14:18:43,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (50/50) (93038fac2c92e2b79623579968f9a0da) switched from CANCELING to CANCELED

14:18:43,872 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (33/50) (e1735a0ce66449f357936dcc147053b9) switched from CANCELING to CANCELED

14:18:44,474 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (45/50) (e197594d273125a930335d93cf51254b) switched from CANCELING to CANCELED

14:18:44,594 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (17/50) (fcf8dba87b366a50a9c048dcca6cc18a) switched from CANCELING to CANCELED

14:18:44,647 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (36/50) (6154f1df4b63ad96a7d4d03b2a633fb1) switched from CANCELING to CANCELED

14:18:44,926 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (25/50) (42c877823146567658ad6be8f5825744) switched from CANCELING to CANCELED

14:18:46,892 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (32/50) (60686a6909ab09b03ea6ee8fd87c8a99) switched from CANCELING to CANCELED

14:18:47,489 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (9/50) (91a0e1349ee875d8f4786a204f6096a3) switched from CANCELING to CANCELED

14:18:47,547 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (26/50) (9883561c52a4c7ae4decdcc3ce9a371f) switched from CANCELING to CANCELED

14:18:52,491 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (43/50) (8424b4ef2960c29fdcae110e565b0316) switched from CANCELING to CANCELED

14:18:52,493 INFO  org.apache.flink.yarn.YarnJobManager                          - Status of job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT) changed to FAILED.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)

    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)

    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)

    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

    at java.lang.Thread.run(Thread.java:744)

14:18:53,017 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:38583] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

14:20:13,057 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.21.125.31:42048]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.21.125.31:42048

14:21:53,076 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.21.125.31:42048]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.21.125.31:42048

 

 




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

rmetzger0
Hi Arnaud,

can you try running the job again with the configuration value of "yarn.heap-cutoff-ratio" set to 0.5.
As you can see, the container has been killed because it used more than 12 GB : "12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: "java -Xms9216m -Xmx9216m"

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, but sadly, the heap space is not the only memory the JVM is allocating. Its allocating direct memory, and other stuff outside. Therefore, we use only 75% of the container memory to the heap.
In your case, I assume that each JVM is having multiple HDFS clients, a lot of local threads etc.... that's why the memory might not suffice.
With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

That value might be a bit too high .. but I want to make sure that we first identify the issue.
If the job is running with 50% cutoff, you can try to reduce it again towards 25% (that's the default value, unlike the documentation says).

I hope that helps.

Regards,
Robert


On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I use the brand new 0.10 version and I have problems running a streaming execution. My topology is linear : a custom source SC scans a directory and emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; a filter F filters lines ; another mapper M2 transforms them ; and a mapper/sink M3->SK stores them in HDFS.

 

SC->M1->F->M2->M3->SK

 

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row static table inside a hash map to enrich the lines. I use 55 slots on Yarn, using 11 containers of 12Gb x 5 slots

 

To my understanding, I should not have any memory problem since each record is independent : no join, no key, no aggregation, no window => it’s a simple flow mapper, with RAM simply used as a buffer. However, if I submit enough input data, I systematically crash my app with “Connection unexpectedly closed by remote task manager” exception, and the first error in YARN log shows that “a container is running beyond physical memory limits”.

 

If I increase the container size, I simply need to feed in more data to get the crash happen.

 

Any idea?

 

Greetings,

Arnaud

 

_________________________________

Exceptions in Flink dashboard detail :

 

Root Exception :

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

       at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

       at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

       at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

       at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

       at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

       at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

       at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306)

       at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

       at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

       at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)

       at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)

       at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)

       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)

       at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

       at java.lang.Thread.run(Thread.java:744)

 

 

Workers exception :

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager afe405e5c982a0317dd5b98108a33eef @ h1r2dn09 - 5 slots - URL: akka.tcp://flink@172.21.125.31:42012/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:152)

        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Yarn log extract :

 

13:46:18,346 INFO  org.apache.flink.yarn.ApplicationMaster                       - YARN daemon runs as yarn setting user to execute Flink ApplicationMaster/JobManager to datcrypt

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       - --------------------------------------------------------------------------------

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Starting YARN ApplicationMaster/JobManager (Version: 0.10.0, Rev:ab2cca4, Date:10.11.2015 @ 13:50:14 UTC)

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Current user: yarn

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.7/24.45-b08

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Maximum heap size: 1388 MiBytes

13:46:18,351 INFO  org.apache.flink.yarn.ApplicationMaster                       -  JAVA_HOME: /usr/java/default

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Hadoop version: 2.6.0

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -  JVM Options:

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Xmx1448M

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Dlog.file=/data/7/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/jobmanager.log

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Dlogback.configurationFile=file:logback.xml

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -     -Dlog4j.configuration=file:log4j.properties

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       -  Program Arguments: (none)

13:46:18,353 INFO  org.apache.flink.yarn.ApplicationMaster                       - --------------------------------------------------------------------------------

13:46:18,355 INFO  org.apache.flink.yarn.ApplicationMaster                       - registered UNIX signal handlers for [TERM, HUP, INT]

13:46:18,375 INFO  org.apache.flink.yarn.ApplicationMaster                       - Starting ApplicationMaster/JobManager in streaming mode

13:46:18,376 INFO  org.apache.flink.yarn.ApplicationMaster                       - Loading config from: /data/10/hadoop/yarn/local/usercache/datcrypt/appcache/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001.

13:46:18,416 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager

13:46:18,421 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 172.21.125.28:0

13:46:19,022 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started

13:46:19,081 INFO  Remoting                                                      - Starting remoting

13:46:19,248 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@172.21.125.28:50891]

13:46:19,256 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend

13:46:19,283 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-cfd99c17-62ed-4bc8-b2d5-9fde1248ee75 for the web interface files

13:46:19,284 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /data/7/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/jobmanager.log

13:46:19,284 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /data/7/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/jobmanager.out

13:46:19,614 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:55890

13:46:19,616 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor

13:46:19,623 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-1dc260e8-ff31-4b07-9bf6-8630504fd2c6

13:46:19,624 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:35970 - max concurrent requests: 50 - max backlog: 1000

13:46:19,643 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@172.21.125.28:50891/user/jobmanager on port 55890

13:46:19,643 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.21.125.28:50891/user/jobmanager:null.

13:46:19,644 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive

13:46:19,644 INFO  org.apache.flink.yarn.YarnJobManager                          - Starting JobManager at akka.tcp://flink@172.21.125.28:50891/user/jobmanager.

13:46:19,652 INFO  org.apache.flink.yarn.ApplicationMaster                       - Generate configuration file for application master.

13:46:19,656 INFO  org.apache.flink.yarn.YarnJobManager                          - JobManager akka.tcp://flink@172.21.125.28:50891/user/jobmanager was granted leadership with leader session ID None.

13:46:19,668 INFO  org.apache.flink.yarn.ApplicationMaster                       - Starting YARN session on Job Manager.

13:46:19,668 INFO  org.apache.flink.yarn.ApplicationMaster                       - Application Master properly initiated. Awaiting termination of actor system.

13:46:19,680 INFO  org.apache.flink.yarn.YarnJobManager                          - Start yarn session.

13:46:19,762 INFO  org.apache.flink.yarn.YarnJobManager                          - Yarn session with 11 TaskManagers. Tolerating 11 failed TaskManagers

13:46:20,471 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting to ResourceManager at h1r1nn01.bpa.bouyguestelecom.fr/172.21.125.3:8030

13:46:20,502 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - yarn.client.max-cached-nodemanagers-proxies : 0

13:46:20,503 INFO  org.apache.flink.yarn.YarnJobManager                          - Registering ApplicationMaster with tracking url http://h1r2dn06.bpa.bouyguestelecom.fr:55890.

13:46:20,675 INFO  org.apache.flink.yarn.YarnJobManager                          - Retrieved 0 TaskManagers from previous attempts.

13:46:20,680 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 0.

13:46:20,685 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 1.

13:46:20,685 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 2.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 3.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 4.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 5.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 6.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 7.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 8.

13:46:20,686 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 9.

13:46:20,687 INFO  org.apache.flink.yarn.YarnJobManager                          - Requesting initial TaskManager container 10.

13:46:20,737 INFO  org.apache.flink.yarn.Utils                                   - Copying from file:/data/10/hadoop/yarn/local/usercache/datcrypt/appcache/application_1444907929501_7416/container_e05_1444907929501_7416_01_000001/flink-conf-modified.yaml to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/datcrypt/.flink/application_1444907929501_7416/flink-conf-modified.yaml

13:46:20,983 INFO  org.apache.flink.yarn.YarnJobManager                          - Prepared local resource for modified yaml: resource { scheme: "hdfs" host: "h1r1nn01.bpa.bouyguestelecom.fr" port: 8020 file: "/user/datcrypt/.flink/application_1444907929501_7416/flink-conf-modified.yaml" } size: 5209 timestamp: 1447418780914 type: FILE visibility: APPLICATION

13:46:20,990 INFO  org.apache.flink.yarn.YarnJobManager                          - Create container launch context.

13:46:20,999 INFO  org.apache.flink.yarn.YarnJobManager                          - Starting TM with command=$JAVA_HOME/bin/java -Xms9216m -Xmx9216m -XX:MaxDirectMemorySize=9216m  -Dlog.file="<LOG_DIR>/taskmanager.log" -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err --streamingMode streaming

13:46:21,551 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 0 running. 11 containers missing

13:46:22,085 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn11.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn08.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn06.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn04.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn02.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn03.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn08.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn09.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn12.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn07.bpa.bouyguestelecom.fr:45454

13:46:22,086 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn11.bpa.bouyguestelecom.fr:45454

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000002

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000003

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000004

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000005

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000006

13:46:22,102 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000007

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000008

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000009

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000010

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000011

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000012

13:46:22,103 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 0 running. 11 containers missing

13:46:22,104 INFO  org.apache.flink.yarn.YarnJobManager                          - 11 containers already allocated by YARN. Starting...

13:46:22,106 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn11.bpa.bouyguestelecom.fr:45454

13:46:22,151 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000002 on host h1r1dn11.bpa.bouyguestelecom.fr).

13:46:22,152 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn08.bpa.bouyguestelecom.fr:45454

13:46:22,162 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000003 on host h1r2dn08.bpa.bouyguestelecom.fr).

13:46:22,162 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn06.bpa.bouyguestelecom.fr:45454

13:46:22,171 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000004 on host h1r2dn06.bpa.bouyguestelecom.fr).

13:46:22,172 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn04.bpa.bouyguestelecom.fr:45454

13:46:22,182 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000005 on host h1r1dn04.bpa.bouyguestelecom.fr).

13:46:22,183 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn02.bpa.bouyguestelecom.fr:45454

13:46:22,191 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000006 on host h1r1dn02.bpa.bouyguestelecom.fr).

13:46:22,192 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn03.bpa.bouyguestelecom.fr:45454

13:46:22,202 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000007 on host h1r2dn03.bpa.bouyguestelecom.fr).

13:46:22,202 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn08.bpa.bouyguestelecom.fr:45454

13:46:22,257 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000008 on host h1r1dn08.bpa.bouyguestelecom.fr).

13:46:22,258 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn09.bpa.bouyguestelecom.fr:45454

13:46:22,267 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000009 on host h1r2dn09.bpa.bouyguestelecom.fr).

13:46:22,268 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r1dn12.bpa.bouyguestelecom.fr:45454

13:46:22,279 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000010 on host h1r1dn12.bpa.bouyguestelecom.fr).

13:46:22,280 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn07.bpa.bouyguestelecom.fr:45454

13:46:22,290 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000011 on host h1r2dn07.bpa.bouyguestelecom.fr).

13:46:22,290 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn11.bpa.bouyguestelecom.fr:45454

13:46:22,298 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000012 on host h1r2dn11.bpa.bouyguestelecom.fr).

13:46:26,430 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn06 (akka.tcp://flink@172.21.125.28:36435/user/taskmanager) as ec3fba3dc109840893fdefadf9d25769. Current number of registered hosts is 1. Current number of alive task slots is 5.

13:46:26,570 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn08 (akka.tcp://flink@172.21.125.18:49837/user/taskmanager) as d118671119071c18654fd99ce0ef2b26. Current number of registered hosts is 2. Current number of alive task slots is 10.

13:46:26,580 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn11 (akka.tcp://flink@172.21.125.33:45822/user/taskmanager) as f90850c024cdcf76f97d6a6f80aa656b. Current number of registered hosts is 3. Current number of alive task slots is 15.

13:46:26,650 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn08 (akka.tcp://flink@172.21.125.30:52119/user/taskmanager) as 61397a13eccdddb37ff71e28b6655c3c. Current number of registered hosts is 4. Current number of alive task slots is 20.

13:46:26,819 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn12 (akka.tcp://flink@172.21.125.22:42508/user/taskmanager) as 2a1c431c9c5727a727c685eb2bbf2ab0. Current number of registered hosts is 5. Current number of alive task slots is 25.

13:46:26,874 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn07 (akka.tcp://flink@172.21.125.29:43648/user/taskmanager) as 46960d7a66ba385f359a4fdc86fae507. Current number of registered hosts is 6. Current number of alive task slots is 30.

13:46:26,913 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn09 (akka.tcp://flink@172.21.125.31:42048/user/taskmanager) as 44bf60d7f6bc695548b06953ce8c7a01. Current number of registered hosts is 7. Current number of alive task slots is 35.

13:46:27,083 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn02 (akka.tcp://flink@172.21.125.12:46609/user/taskmanager) as d4c153b853ad95e16b3d9f311381d56d. Current number of registered hosts is 8. Current number of alive task slots is 40.

13:46:27,161 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn04 (akka.tcp://flink@172.21.125.14:58610/user/taskmanager) as 051a3c73e96111f0ab04608583c317ff. Current number of registered hosts is 9. Current number of alive task slots is 45.

13:46:27,496 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r1dn11 (akka.tcp://flink@172.21.125.21:36403/user/taskmanager) as 098dee828632d7f3ddc2877e997df866. Current number of registered hosts is 10. Current number of alive task slots is 50.

13:46:29,159 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn03 (akka.tcp://flink@172.21.125.25:58890/user/taskmanager) as bb8607fb775d2219ed384c7f15bda19e. Current number of registered hosts is 11. Current number of alive task slots is 55.

 

13:49:16,869 INFO  org.apache.flink.yarn.YarnJobManager                          - Submitting job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT).

13:49:16,873 INFO  org.apache.flink.yarn.YarnJobManager                          - Scheduling job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT).

13:49:16,873 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from CREATED to SCHEDULED

13:49:16,873 INFO  org.apache.flink.yarn.YarnJobManager                          - Status of job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT) changed to RUNNING.

13:49:16,874 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from SCHEDULED to DEPLOYING

13:49:16,874 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (attempt #0) to h1r2dn09

13:49:16,874 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from CREATED to SCHEDULED

13:49:16,875 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from SCHEDULED to DEPLOYING

13:49:16,875 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (attempt #0) to h1r2dn09

13:49:16,876 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from CREATED to SCHEDULED

13:49:16,876 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from SCHEDULED to DEPLOYING

 

(…)

(The apps runs well)

(…)

 

14:18:27,692 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@172.21.125.31:42048] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14:18:29,987 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000009 is completed with diagnostics: Container [pid=44905,containerID=container_e05_1444907929501_7416_01_000009] is running beyond physical memory limits. Current usage: 12.1 GB of 12 GB physical memory used; 15.9 GB of 25.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_e05_1444907929501_7416_01_000009 :
         |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
         |- 44905 44903 44905 44905 (bash) 0 0 108650496 311 /bin/bash -c /usr/java/default/bin/java -Xms9216m -Xmx9216m -XX:MaxDirectMemorySize=9216m  -Dlog.file=/data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> /data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.out 2> /data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.err --streamingMode streaming 
         |- 44914 44905 44905 44905 (java) 345746 7991 17012555776 3165572 /usr/java/default/bin/java -Xms9216m -Xmx9216m -XX:MaxDirectMemorySize=9216m -Dlog.file=/data/2/hadoop/yarn/log/application_1444907929501_7416/container_e05_1444907929501_7416_01_000009/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . --streamingMode streaming 
 
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

14:18:29,988 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000009 was a running container. Total failed containers 1.

14:18:29,988 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 10 running. 1 containers missing

14:18:29,989 INFO  org.apache.flink.yarn.YarnJobManager                          - There are 1 containers missing. 0 are already requested. Requesting 1 additional container(s) from YARN. Reallocation of failed containers is enabled=true ('yarn.reallocate-failed')

14:18:29,990 INFO  org.apache.flink.yarn.YarnJobManager                          - Requested additional container from YARN. Pending requests 1.

14:18:30,503 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 10 running. 1 containers missing

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn10.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn05.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn02.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn12.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn04.bpa.bouyguestelecom.fr:45454

14:18:31,028 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000013

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000014

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000015

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000016

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000017

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000018

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000019

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000020

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000021

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000022

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - The user requested 11 containers, 10 running. 1 containers missing

14:18:31,029 INFO  org.apache.flink.yarn.YarnJobManager                          - 10 containers already allocated by YARN. Starting...

14:18:31,030 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : h1r2dn03.bpa.bouyguestelecom.fr:45454

14:18:31,041 INFO  org.apache.flink.yarn.YarnJobManager                          - Launching container (container_e05_1444907929501_7416_01_000013 on host h1r2dn03.bpa.bouyguestelecom.fr).

14:18:31,042 INFO  org.apache.flink.yarn.YarnJobManager                          - Flink has 9 allocated containers which are not needed right now. Returning them

14:18:33,030 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (20/50) (7f278c64beacbad498199a4e7cb1228f) switched from RUNNING to FAILED

14:18:33,032 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from RUNNING to CANCELING

14:18:33,032 INFO  org.apache.flink.yarn.YarnJobManager                          - Status of job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT) changed to FAILING.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)

    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)

    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)

    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

    at java.lang.Thread.run(Thread.java:744)

14:18:33,032 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 

14:18:33,032 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (3/50) (c586e6e833e331aaa48a68d57f9252bf) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (4/50) (5ed0f29817b241e9eb04a8876fc76737) switched from RUNNING to CANCELING

14:18:33,033 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (5/50) (b572b973c66f3052c431fe6c10e5f1ff) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (6/50) (48d3e7ba592b47add633f6e65bf4fba7) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (7/50) (66b0717c028e19a4df23dee6341b56cd) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (8/50) (193fd3a321a5c5711d369691544fa7f7) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (9/50) (91a0e1349ee875d8f4786a204f6096a3) switched from RUNNING to CANCELING

14:18:33,034 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (10/50) (1ddea0d9b8fb4a58e8f3150756267b9d) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (11/50) (bc26fe9aaf621f9d9c8e5358d2f1208f) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (12/50) (d71455ecb37eeb387a1da56a2f5964de) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (13/50) (a57717ee1b7e9e91091429b4a3c7dbb3) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (14/50) (8b1d11ed1a3cc51ba981b1006b7eb757) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (15/50) (f3e8182fca0651ca2c8a70bd660695c3) switched from RUNNING to CANCELING

14:18:33,035 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (16/50) (bf5e18f069b74fc284bcf7b324c30277) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (17/50) (fcf8dba87b366a50a9c048dcca6cc18a) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (18/50) (811bf08e6b0c801dee0e4092777d3802) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (19/50) (93fb94616a05b5a0fa0ca7e553778565) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (21/50) (28193eaa9448382b8f462189400519c6) switched from RUNNING to CANCELING

14:18:33,036 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (22/50) (d6548a95eb0d96c130c42bbe3cfa2c0b) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (23/50) (1e9f2598f8bb97d494bb3fe16a9577b2) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (24/50) (1aca8044e5026f741089945b02987435) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (25/50) (42c877823146567658ad6be8f5825744) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (26/50) (9883561c52a4c7ae4decdcc3ce9a371f) switched from RUNNING to CANCELING

14:18:33,037 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (27/50) (5875ea808ef3dc5c561066018dc87e28) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (28/50) (fd88b4c83cf67be6a6e6f73b7a5b83b3) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (29/50) (5b8570f5447752c74c904e6ee73c5e20) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (30/50) (a0c2817b7fa7c4c74c74d0273bd6422c) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (31/50) (b9bbfe013b3f4e6cd3203dd8d6440383) switched from RUNNING to CANCELING

14:18:33,038 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (32/50) (60686a6909ab09b03ea6ee8fd87c8a99) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (33/50) (e1735a0ce66449f357936dcc147053b9) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (34/50) (a368834aec80833291d4a4a5cf974b5d) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (35/50) (d37e0742c5fbf0798c02f1384e8f9bc0) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (36/50) (6154f1df4b63ad96a7d4d03b2a633fb1) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (37/50) (02f33613549b380fd901fcef94136eaa) switched from RUNNING to CANCELING

14:18:33,039 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (38/50) (34aaaa6576c48b55225e2b446d663621) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (39/50) (b1d4c9aec8114c3889e2c02da0e3d41d) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (40/50) (63833c22e3b6b79d15f12c5b5bb6a70b) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (41/50) (25c16c9065a5486f0e8dcd0cc3f9c6d1) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (42/50) (7b03ade33a4048d3ceed773fda2f9b23) switched from RUNNING to CANCELING

14:18:33,040 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (43/50) (8424b4ef2960c29fdcae110e565b0316) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (44/50) (921a6a7068b2830c4cb4dd5445cae1f2) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (45/50) (e197594d273125a930335d93cf51254b) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (46/50) (885e3d1c85765a9270fdee1bfd9b9852) switched from RUNNING to CANCELING

14:18:33,041 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (47/50) (146174eb5be9dbe2296d806783f61868) switched from RUNNING to CANCELING

14:18:33,042 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (48/50) (9e205e3d4bf6c665c3b817c3b622fef4) switched from RUNNING to CANCELING

14:18:33,042 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (49/50) (2650dc529b7149928d41ec35096fd475) switched from RUNNING to CANCELING

14:18:33,042 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (50/50) (93038fac2c92e2b79623579968f9a0da) switched from RUNNING to CANCELING

14:18:33,063 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.21.125.31:42048]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.21.125.31:42048

14:18:33,067 INFO  org.apache.flink.yarn.YarnJobManager                          - Task manager akka.tcp://flink@172.21.125.31:42048/user/taskmanager terminated.

14:18:33,068 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (4/50) (5ed0f29817b241e9eb04a8876fc76737) switched from CANCELING to FAILED

14:18:33,068 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (1/50) (764f26e2ae4f5d68e7881928dee8eea8) switched from CANCELING to FAILED

14:18:33,069 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Scan /PHOCUS2/DEV/UBV/IN (1/1) (9aa2539432a19f5b401b92ebcc32ab4e) switched from CANCELING to FAILED

14:18:33,069 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (2/50) (f9ca3935b76ef8376a55c31b57e38388) switched from CANCELING to FAILED

14:18:33,070 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (5/50) (b572b973c66f3052c431fe6c10e5f1ff) switched from CANCELING to FAILED

14:18:33,070 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (3/50) (c586e6e833e331aaa48a68d57f9252bf) switched from CANCELING to FAILED

14:18:33,071 INFO  org.apache.flink.runtime.instance.InstanceManager             - Unregistered task manager akka.tcp://flink@172.21.125.31:42048/user/taskmanager. Number of registered task managers 10. Number of available slots 50.

14:18:33,332 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (8/50) (193fd3a321a5c5711d369691544fa7f7) switched from CANCELING to CANCELED

14:18:33,403 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (11/50) (bc26fe9aaf621f9d9c8e5358d2f1208f) switched from CANCELING to CANCELED

14:18:34,050 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (24/50) (1aca8044e5026f741089945b02987435) switched from CANCELING to CANCELED

14:18:34,522 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (35/50) (d37e0742c5fbf0798c02f1384e8f9bc0) switched from CANCELING to CANCELED

14:18:34,683 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (48/50) (9e205e3d4bf6c665c3b817c3b622fef4) switched from CANCELING to CANCELED

14:18:34,714 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (16/50) (bf5e18f069b74fc284bcf7b324c30277) switched from CANCELING to CANCELED

14:18:34,919 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at h1r2dn03 (akka.tcp://flink@172.21.125.25:44296/user/taskmanager) as d65581cce8a29775ba3df10d02756d11. Current number of registered hosts is 11. Current number of alive task slots is 55.

14:18:35,410 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (31/50) (b9bbfe013b3f4e6cd3203dd8d6440383) switched from CANCELING to CANCELED

14:18:35,513 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (30/50) (a0c2817b7fa7c4c74c74d0273bd6422c) switched from CANCELING to CANCELED

14:18:35,739 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (7/50) (66b0717c028e19a4df23dee6341b56cd) switched from CANCELING to CANCELED

14:18:35,896 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (41/50) (25c16c9065a5486f0e8dcd0cc3f9c6d1) switched from CANCELING to CANCELED

14:18:35,948 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (13/50) (a57717ee1b7e9e91091429b4a3c7dbb3) switched from CANCELING to CANCELED

14:18:36,073 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r1dn09.bpa.bouyguestelecom.fr:45454

14:18:36,073 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl         - Received new token for : h1r2dn01.bpa.bouyguestelecom.fr:45454

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000023

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Got new container for allocation: container_e05_1444907929501_7416_01_000024

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000017 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000014 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000018 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000022 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000019 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000020 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000016 is completed with diagnostics: Container released by application

14:18:36,074 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000015 is completed with diagnostics: Container released by application

14:18:36,075 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000021 is completed with diagnostics: Container released by application

14:18:36,075 INFO  org.apache.flink.yarn.YarnJobManager                          - Flink has 2 allocated containers which are not needed right now. Returning them

14:18:36,197 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (6/50) (48d3e7ba592b47add633f6e65bf4fba7) switched from CANCELING to CANCELED

14:18:36,331 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (38/50) (34aaaa6576c48b55225e2b446d663621) switched from CANCELING to CANCELED

14:18:36,540 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (34/50) (a368834aec80833291d4a4a5cf974b5d) switched from CANCELING to CANCELED

14:18:36,645 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (46/50) (885e3d1c85765a9270fdee1bfd9b9852) switched from CANCELING to CANCELED

14:18:37,084 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (22/50) (d6548a95eb0d96c130c42bbe3cfa2c0b) switched from CANCELING to CANCELED

14:18:37,398 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (39/50) (b1d4c9aec8114c3889e2c02da0e3d41d) switched from CANCELING to CANCELED

14:18:37,834 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (28/50) (fd88b4c83cf67be6a6e6f73b7a5b83b3) switched from CANCELING to CANCELED

14:18:38,433 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (14/50) (8b1d11ed1a3cc51ba981b1006b7eb757) switched from CANCELING to CANCELED

14:18:38,542 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (42/50) (7b03ade33a4048d3ceed773fda2f9b23) switched from CANCELING to CANCELED

14:18:39,011 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (12/50) (d71455ecb37eeb387a1da56a2f5964de) switched from CANCELING to CANCELED

14:18:39,088 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (44/50) (921a6a7068b2830c4cb4dd5445cae1f2) switched from CANCELING to CANCELED

14:18:39,194 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (10/50) (1ddea0d9b8fb4a58e8f3150756267b9d) switched from CANCELING to CANCELED

14:18:39,625 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (21/50) (28193eaa9448382b8f462189400519c6) switched from CANCELING to CANCELED

14:18:40,299 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (29/50) (5b8570f5447752c74c904e6ee73c5e20) switched from CANCELING to CANCELED

14:18:40,572 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (18/50) (811bf08e6b0c801dee0e4092777d3802) switched from CANCELING to CANCELED

14:18:40,985 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (23/50) (1e9f2598f8bb97d494bb3fe16a9577b2) switched from CANCELING to CANCELED

14:18:41,094 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000024 is completed with diagnostics: Container released by application

14:18:41,094 INFO  org.apache.flink.yarn.YarnJobManager                          - Container container_e05_1444907929501_7416_01_000023 is completed with diagnostics: Container released by application

14:18:41,162 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (15/50) (f3e8182fca0651ca2c8a70bd660695c3) switched from CANCELING to CANCELED

14:18:41,625 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (40/50) (63833c22e3b6b79d15f12c5b5bb6a70b) switched from CANCELING to CANCELED

14:18:41,710 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (47/50) (146174eb5be9dbe2296d806783f61868) switched from CANCELING to CANCELED

14:18:41,828 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (37/50) (02f33613549b380fd901fcef94136eaa) switched from CANCELING to CANCELED

14:18:41,932 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (27/50) (5875ea808ef3dc5c561066018dc87e28) switched from CANCELING to CANCELED

14:18:41,947 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (19/50) (93fb94616a05b5a0fa0ca7e553778565) switched from CANCELING to CANCELED

14:18:42,806 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (49/50) (2650dc529b7149928d41ec35096fd475) switched from CANCELING to CANCELED

14:18:43,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (50/50) (93038fac2c92e2b79623579968f9a0da) switched from CANCELING to CANCELED

14:18:43,872 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (33/50) (e1735a0ce66449f357936dcc147053b9) switched from CANCELING to CANCELED

14:18:44,474 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (45/50) (e197594d273125a930335d93cf51254b) switched from CANCELING to CANCELED

14:18:44,594 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (17/50) (fcf8dba87b366a50a9c048dcca6cc18a) switched from CANCELING to CANCELED

14:18:44,647 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (36/50) (6154f1df4b63ad96a7d4d03b2a633fb1) switched from CANCELING to CANCELED

14:18:44,926 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (25/50) (42c877823146567658ad6be8f5825744) switched from CANCELING to CANCELED

14:18:46,892 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (32/50) (60686a6909ab09b03ea6ee8fd87c8a99) switched from CANCELING to CANCELED

14:18:47,489 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (9/50) (91a0e1349ee875d8f4786a204f6096a3) switched from CANCELING to CANCELED

14:18:47,547 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (26/50) (9883561c52a4c7ae4decdcc3ce9a371f) switched from CANCELING to CANCELED

14:18:52,491 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> Filter -> Flat Map -> Map -> Sink: Unnamed (43/50) (8424b4ef2960c29fdcae110e565b0316) switched from CANCELING to CANCELED

14:18:52,493 INFO  org.apache.flink.yarn.YarnJobManager                          - Status of job 0348520c75b5c18c9833d8eb4732c009 (KUBERA-GEO-SOURCE2BRUT) changed to FAILED.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)

    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)

    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)

    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)

    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

    at java.lang.Thread.run(Thread.java:744)

14:18:53,017 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:38583] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

14:20:13,057 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.21.125.31:42048]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.21.125.31:42048

14:21:53,076 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.21.125.31:42048]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.21.125.31:42048

 

 




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.

Reply | Threaded
Open this post in threaded view
|

RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

LINZ, Arnaud

Hi Robert,

 

Thanks, it works with 50% -- at least way past the previous crash point.

 

In my opinion (I lack real metrics), the part that uses the most memory is the M2 mapper, instantiated once per slot.

The most complex part is the Sink (it does use a lot of hdfs files, flushing threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only once per slot ? I’m really surprised by that memory usage, I will try using a monitoring app on the yarn jvm to understand.

 

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application ? I don’t want to modify the “root-protected” flink-conf.yaml for all the users & flink jobs with that value.

 

Regards,

Arnaud

 

De : Robert Metzger [mailto:[hidden email]]
Envoyé : vendredi 13 novembre 2015 15:16
À : [hidden email]
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Arnaud,

 

can you try running the job again with the configuration value of "yarn.heap-cutoff-ratio" set to 0.5.

As you can see, the container has been killed because it used more than 12 GB : "12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: "java -Xms9216m -Xmx9216m"

 

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, but sadly, the heap space is not the only memory the JVM is allocating. Its allocating direct memory, and other stuff outside. Therefore, we use only 75% of the container memory to the heap.

In your case, I assume that each JVM is having multiple HDFS clients, a lot of local threads etc.... that's why the memory might not suffice.

With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

 

That value might be a bit too high .. but I want to make sure that we first identify the issue.

If the job is running with 50% cutoff, you can try to reduce it again towards 25% (that's the default value, unlike the documentation says).

 

I hope that helps.

 

Regards,

Robert

 

 

On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I use the brand new 0.10 version and I have problems running a streaming execution. My topology is linear : a custom source SC scans a directory and emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; a filter F filters lines ; another mapper M2 transforms them ; and a mapper/sink M3->SK stores them in HDFS.

 

SC->M1->F->M2->M3->SK

 

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row static table inside a hash map to enrich the lines. I use 55 slots on Yarn, using 11 containers of 12Gb x 5 slots

 

To my understanding, I should not have any memory problem since each record is independent : no join, no key, no aggregation, no window => it’s a simple flow mapper, with RAM simply used as a buffer. However, if I submit enough input data, I systematically crash my app with “Connection unexpectedly closed by remote task manager” exception, and the first error in YARN log shows that “a container is running beyond physical memory limits”.

 

If I increase the container size, I simply need to feed in more data to get the crash happen.

 

Any idea?

 

Greetings,

Arnaud

 

_________________________________

Exceptions in Flink dashboard detail :

 

Root Exception :

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

       at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

(…)




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

rmetzger0
Hi Arnaud,

your M2 mapper is allocating memory on the JVM heap. That should not cause any issues, because the heap is limited to 9.2GB anyways. The problem are offheap allocations.
The RichSinkFunction is only instantiated once per slot, yes.

You can set application specific parameters using "-Dyarn.heap-cutoff-ratio=0.5".

Regards,
Robert





On Fri, Nov 13, 2015 at 3:49 PM, LINZ, Arnaud <[hidden email]> wrote:

Hi Robert,

 

Thanks, it works with 50% -- at least way past the previous crash point.

 

In my opinion (I lack real metrics), the part that uses the most memory is the M2 mapper, instantiated once per slot.

The most complex part is the Sink (it does use a lot of hdfs files, flushing threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only once per slot ? I’m really surprised by that memory usage, I will try using a monitoring app on the yarn jvm to understand.

 

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application ? I don’t want to modify the “root-protected” flink-conf.yaml for all the users & flink jobs with that value.

 

Regards,

Arnaud

 

De : Robert Metzger [mailto:[hidden email]]
Envoyé : vendredi 13 novembre 2015 15:16
À : [hidden email]
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Arnaud,

 

can you try running the job again with the configuration value of "yarn.heap-cutoff-ratio" set to 0.5.

As you can see, the container has been killed because it used more than 12 GB : "12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: "java -Xms9216m -Xmx9216m"

 

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, but sadly, the heap space is not the only memory the JVM is allocating. Its allocating direct memory, and other stuff outside. Therefore, we use only 75% of the container memory to the heap.

In your case, I assume that each JVM is having multiple HDFS clients, a lot of local threads etc.... that's why the memory might not suffice.

With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

 

That value might be a bit too high .. but I want to make sure that we first identify the issue.

If the job is running with 50% cutoff, you can try to reduce it again towards 25% (that's the default value, unlike the documentation says).

 

I hope that helps.

 

Regards,

Robert

 

 

On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I use the brand new 0.10 version and I have problems running a streaming execution. My topology is linear : a custom source SC scans a directory and emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; a filter F filters lines ; another mapper M2 transforms them ; and a mapper/sink M3->SK stores them in HDFS.

 

SC->M1->F->M2->M3->SK

 

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row static table inside a hash map to enrich the lines. I use 55 slots on Yarn, using 11 containers of 12Gb x 5 slots

 

To my understanding, I should not have any memory problem since each record is independent : no join, no key, no aggregation, no window => it’s a simple flow mapper, with RAM simply used as a buffer. However, if I submit enough input data, I systematically crash my app with “Connection unexpectedly closed by remote task manager” exception, and the first error in YARN log shows that “a container is running beyond physical memory limits”.

 

If I increase the container size, I simply need to feed in more data to get the crash happen.

 

Any idea?

 

Greetings,

Arnaud

 

_________________________________

Exceptions in Flink dashboard detail :

 

Root Exception :

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

       at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

(…)




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.

Reply | Threaded
Open this post in threaded view
|

Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

Ufuk Celebi
In reply to this post by LINZ, Arnaud

> On 13 Nov 2015, at 15:49, LINZ, Arnaud <[hidden email]> wrote:
>
> Hi Robert,
>  
> Thanks, it works with 50% -- at least way past the previous crash point.
>  
> In my opinion (I lack real metrics), the part that uses the most memory is the M2 mapper, instantiated once per slot.
> The most complex part is the Sink (it does use a lot of hdfs files, flushing threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only once per slot ? I’m really surprised by that memory usage, I will try using a monitoring app on the yarn jvm to understand.

In general it’s instantiated once per subtask. For your current deployment, it is one per slot.

– Ufuk

Reply | Threaded
Open this post in threaded view
|

Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

Stephan Ewen
Hi Arnaud!

Java direct-memory is tricky to debug. You can turn on the memory logging or check the TaskManager tab in teh web dashboard - both report on direct memory consumption.

One thing you can look for is forgetting to close streams. That means the streams consume native resources until the Java object is Garbage Collected, which may be quite a bit later.

Greetings,.
Stephan


On Fri, Nov 13, 2015 at 3:59 PM, Ufuk Celebi <[hidden email]> wrote:

> On 13 Nov 2015, at 15:49, LINZ, Arnaud <[hidden email]> wrote:
>
> Hi Robert,
>
> Thanks, it works with 50% -- at least way past the previous crash point.
>
> In my opinion (I lack real metrics), the part that uses the most memory is the M2 mapper, instantiated once per slot.
> The most complex part is the Sink (it does use a lot of hdfs files, flushing threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only once per slot ? I’m really surprised by that memory usage, I will try using a monitoring app on the yarn jvm to understand.

In general it’s instantiated once per subtask. For your current deployment, it is one per slot.

– Ufuk


Reply | Threaded
Open this post in threaded view
|

RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

LINZ, Arnaud
In reply to this post by rmetzger0

Hello,

 

I did have an off-heap memory leak in my streaming application, due to :

https://issues.apache.org/jira/browse/HADOOP-12007.

 

Now that I use the CodecPool to close that leak, I get under load the following error :

 

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: java.lang.OutOfMemoryError: Direct buffer memory

    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:153)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)

    at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)

    at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:737)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:310)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)

    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)

    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

    at java.lang.Thread.run(Thread.java:744)

Caused by: io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory

    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:234)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

    ... 9 more

Caused by: java.lang.OutOfMemoryError: Direct buffer memory

    at java.nio.Bits.reserveMemory(Bits.java:658)

    at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)

    at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)

    at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:651)

    at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)

    at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)

    at io.netty.buffer.PoolArena.reallocate(PoolArena.java:358)

    at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:111)

    at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)

    at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92)

    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:228)

    ... 10 more

 

 

But the JVM Heap is ok (monitored by JVisualVM) and the memory size of the JVM process is half what it was with the memory leak when Yarn killed the container.

 

Note that I have added a “PartitionBy” in my stream process before the sink and my app is no longer a simple “mapper style” app.

 

Do you known the cause of the error and how to correct it ?

 

Best regards,

 

Arnaud

 

 

 

De : LINZ, Arnaud
Envoyé : vendredi 13 novembre 2015 15:49
À : '[hidden email]' <[hidden email]>
Objet : RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Robert,

 

Thanks, it works with 50% -- at least way past the previous crash point.

 

In my opinion (I lack real metrics), the part that uses the most memory is the M2 mapper, instantiated once per slot.

The most complex part is the Sink (it does use a lot of hdfs files, flushing threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only once per slot ? I’m really surprised by that memory usage, I will try using a monitoring app on the yarn jvm to understand.

 

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application ? I don’t want to modify the “root-protected” flink-conf.yaml for all the users & flink jobs with that value.

 

Regards,

Arnaud

 

De : Robert Metzger [[hidden email]]
Envoyé : vendredi 13 novembre 2015 15:16
À :
[hidden email]
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Arnaud,

 

can you try running the job again with the configuration value of "yarn.heap-cutoff-ratio" set to 0.5.

As you can see, the container has been killed because it used more than 12 GB : "12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: "java -Xms9216m -Xmx9216m"

 

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, but sadly, the heap space is not the only memory the JVM is allocating. Its allocating direct memory, and other stuff outside. Therefore, we use only 75% of the container memory to the heap.

In your case, I assume that each JVM is having multiple HDFS clients, a lot of local threads etc.... that's why the memory might not suffice.

With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

 

That value might be a bit too high .. but I want to make sure that we first identify the issue.

If the job is running with 50% cutoff, you can try to reduce it again towards 25% (that's the default value, unlike the documentation says).

 

I hope that helps.

 

Regards,

Robert

 

 

On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I use the brand new 0.10 version and I have problems running a streaming execution. My topology is linear : a custom source SC scans a directory and emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; a filter F filters lines ; another mapper M2 transforms them ; and a mapper/sink M3->SK stores them in HDFS.

 

SC->M1->F->M2->M3->SK

 

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row static table inside a hash map to enrich the lines. I use 55 slots on Yarn, using 11 containers of 12Gb x 5 slots

 

To my understanding, I should not have any memory problem since each record is independent : no join, no key, no aggregation, no window => it’s a simple flow mapper, with RAM simply used as a buffer. However, if I submit enough input data, I systematically crash my app with “Connection unexpectedly closed by remote task manager” exception, and the first error in YARN log shows that “a container is running beyond physical memory limits”.

 

If I increase the container size, I simply need to feed in more data to get the crash happen.

 

Any idea?

 

Greetings,

Arnaud

 

_________________________________

Exceptions in Flink dashboard detail :

 

Root Exception :

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

       at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

(…)




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

Stephan Ewen
Hi!

That is curious. Can you tell us a bit more about your setup?

  - Did you set Flink to use off-heap memory in the config?
  - What parallelism do you run the job with?
  - What Java and Flink versions are you using?

Even better, can you paste the first part of the TaskManager's log (where it prints the environment) here?

Thanks,
Stephan


On Mon, Dec 14, 2015 at 9:57 AM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I did have an off-heap memory leak in my streaming application, due to :

https://issues.apache.org/jira/browse/HADOOP-12007.

 

Now that I use the CodecPool to close that leak, I get under load the following error :

 

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: java.lang.OutOfMemoryError: Direct buffer memory

    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:153)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)

    at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)

    at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:737)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:310)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)

    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)

    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

    at java.lang.Thread.run(Thread.java:744)

Caused by: io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory

    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:234)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

    ... 9 more

Caused by: java.lang.OutOfMemoryError: Direct buffer memory

    at java.nio.Bits.reserveMemory(Bits.java:658)

    at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)

    at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)

    at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:651)

    at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)

    at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)

    at io.netty.buffer.PoolArena.reallocate(PoolArena.java:358)

    at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:111)

    at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)

    at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92)

    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:228)

    ... 10 more

 

 

But the JVM Heap is ok (monitored by JVisualVM) and the memory size of the JVM process is half what it was with the memory leak when Yarn killed the container.

 

Note that I have added a “PartitionBy” in my stream process before the sink and my app is no longer a simple “mapper style” app.

 

Do you known the cause of the error and how to correct it ?

 

Best regards,

 

Arnaud

 

 

 

De : LINZ, Arnaud
Envoyé : vendredi 13 novembre 2015 15:49
À : '[hidden email]' <[hidden email]>
Objet : RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Robert,

 

Thanks, it works with 50% -- at least way past the previous crash point.

 

In my opinion (I lack real metrics), the part that uses the most memory is the M2 mapper, instantiated once per slot.

The most complex part is the Sink (it does use a lot of hdfs files, flushing threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only once per slot ? I’m really surprised by that memory usage, I will try using a monitoring app on the yarn jvm to understand.

 

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application ? I don’t want to modify the “root-protected” flink-conf.yaml for all the users & flink jobs with that value.

 

Regards,

Arnaud

 

De : Robert Metzger [[hidden email]]
Envoyé : vendredi 13 novembre 2015 15:16
À :
[hidden email]
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Arnaud,

 

can you try running the job again with the configuration value of "yarn.heap-cutoff-ratio" set to 0.5.

As you can see, the container has been killed because it used more than 12 GB : "12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: "java -Xms9216m -Xmx9216m"

 

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, but sadly, the heap space is not the only memory the JVM is allocating. Its allocating direct memory, and other stuff outside. Therefore, we use only 75% of the container memory to the heap.

In your case, I assume that each JVM is having multiple HDFS clients, a lot of local threads etc.... that's why the memory might not suffice.

With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

 

That value might be a bit too high .. but I want to make sure that we first identify the issue.

If the job is running with 50% cutoff, you can try to reduce it again towards 25% (that's the default value, unlike the documentation says).

 

I hope that helps.

 

Regards,

Robert

 

 

On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I use the brand new 0.10 version and I have problems running a streaming execution. My topology is linear : a custom source SC scans a directory and emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; a filter F filters lines ; another mapper M2 transforms them ; and a mapper/sink M3->SK stores them in HDFS.

 

SC->M1->F->M2->M3->SK

 

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row static table inside a hash map to enrich the lines. I use 55 slots on Yarn, using 11 containers of 12Gb x 5 slots

 

To my understanding, I should not have any memory problem since each record is independent : no join, no key, no aggregation, no window => it’s a simple flow mapper, with RAM simply used as a buffer. However, if I submit enough input data, I systematically crash my app with “Connection unexpectedly closed by remote task manager” exception, and the first error in YARN log shows that “a container is running beyond physical memory limits”.

 

If I increase the container size, I simply need to feed in more data to get the crash happen.

 

Any idea?

 

Greetings,

Arnaud

 

_________________________________

Exceptions in Flink dashboard detail :

 

Root Exception :

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

       at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

(…)




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.

Reply | Threaded
Open this post in threaded view
|

RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

LINZ, Arnaud

Hi,

I’ve just run into another exception, a java.lang.IndexOutOfBoundsException  in the zlib library this time.

Therefore I suspect a problem in the hadoop’s codec pool usage. I’m investigating, and will keep you informed.

 

Thanks,

Arnaud

 

 

De : [hidden email] [mailto:[hidden email]] De la part de Stephan Ewen
Envoyé : lundi 14 décembre 2015 10:54
À : [hidden email]
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi!

 

That is curious. Can you tell us a bit more about your setup?

 

  - Did you set Flink to use off-heap memory in the config?

  - What parallelism do you run the job with?

  - What Java and Flink versions are you using?

 

Even better, can you paste the first part of the TaskManager's log (where it prints the environment) here?

 

Thanks,
Stephan

 

 

On Mon, Dec 14, 2015 at 9:57 AM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I did have an off-heap memory leak in my streaming application, due to :

https://issues.apache.org/jira/browse/HADOOP-12007.

 

Now that I use the CodecPool to close that leak, I get under load the following error :

 

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: java.lang.OutOfMemoryError: Direct buffer memory

    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:153)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)

    at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)

    at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)

    at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)

    at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:737)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:310)

    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)

    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)

    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)

    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)

    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

    at java.lang.Thread.run(Thread.java:744)

Caused by: io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory

    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:234)

    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)

    ... 9 more

Caused by: java.lang.OutOfMemoryError: Direct buffer memory

    at java.nio.Bits.reserveMemory(Bits.java:658)

    at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)

    at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)

    at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:651)

    at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)

    at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)

    at io.netty.buffer.PoolArena.reallocate(PoolArena.java:358)

    at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:111)

    at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)

    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)

    at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92)

    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:228)

    ... 10 more

 

 

But the JVM Heap is ok (monitored by JVisualVM) and the memory size of the JVM process is half what it was with the memory leak when Yarn killed the container.

 

Note that I have added a “PartitionBy” in my stream process before the sink and my app is no longer a simple “mapper style” app.

 

Do you known the cause of the error and how to correct it ?

 

Best regards,

 

Arnaud

 

 

 

De : LINZ, Arnaud
Envoyé : vendredi 13 novembre 2015 15:49
À : '[hidden email]' <[hidden email]>
Objet : RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Robert,

 

Thanks, it works with 50% -- at least way past the previous crash point.

 

In my opinion (I lack real metrics), the part that uses the most memory is the M2 mapper, instantiated once per slot.

The most complex part is the Sink (it does use a lot of hdfs files, flushing threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only once per slot ? I’m really surprised by that memory usage, I will try using a monitoring app on the yarn jvm to understand.

 

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application ? I don’t want to modify the “root-protected” flink-conf.yaml for all the users & flink jobs with that value.

 

Regards,

Arnaud

 

De : Robert Metzger [[hidden email]]
Envoyé : vendredi 13 novembre 2015 15:16
À :
[hidden email]
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

 

Hi Arnaud,

 

can you try running the job again with the configuration value of "yarn.heap-cutoff-ratio" set to 0.5.

As you can see, the container has been killed because it used more than 12 GB : "12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: "java -Xms9216m -Xmx9216m"

 

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, but sadly, the heap space is not the only memory the JVM is allocating. Its allocating direct memory, and other stuff outside. Therefore, we use only 75% of the container memory to the heap.

In your case, I assume that each JVM is having multiple HDFS clients, a lot of local threads etc.... that's why the memory might not suffice.

With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

 

That value might be a bit too high .. but I want to make sure that we first identify the issue.

If the job is running with 50% cutoff, you can try to reduce it again towards 25% (that's the default value, unlike the documentation says).

 

I hope that helps.

 

Regards,

Robert

 

 

On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

I use the brand new 0.10 version and I have problems running a streaming execution. My topology is linear : a custom source SC scans a directory and emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; a filter F filters lines ; another mapper M2 transforms them ; and a mapper/sink M3->SK stores them in HDFS.

 

SC->M1->F->M2->M3->SK

 

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row static table inside a hash map to enrich the lines. I use 55 slots on Yarn, using 11 containers of 12Gb x 5 slots

 

To my understanding, I should not have any memory problem since each record is independent : no join, no key, no aggregation, no window => it’s a simple flow mapper, with RAM simply used as a buffer. However, if I submit enough input data, I systematically crash my app with “Connection unexpectedly closed by remote task manager” exception, and the first error in YARN log shows that “a container is running beyond physical memory limits”.

 

If I increase the container size, I simply need to feed in more data to get the crash happen.

 

Any idea?

 

Greetings,

Arnaud

 

_________________________________

Exceptions in Flink dashboard detail :

 

Root Exception :

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'bt1shli6/172.21.125.31:33186'. This might indicate that the remote task manager was lost.

       at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)

(…)

 



L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.