Hi,
I'm executing a program on a flink cluster. I tried the same on a local node with Eclipse and it worked fine. To start, following Flink recommendations on the cluster I set numberOfTaskSlots equals to the Cpu cores (2) while I set parallelism to 1. Unfortunately when I try to execute I obtain Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #1 (Source: Custom Source -> Timestamps/Watermarks (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < cbc357ccb763df2852fee8c4fc7d55f2 > in sharing group < SlotSharingGroup [e883208d19e3c34f8aaf2a3168a63337, 9dd63673dd41ea021b896d5203f3ba7c, cbc357ccb763df2852fee8c4fc7d55f2] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0 As you can see it says I have 0 available slots... how is this possible?!? I set no chains or sharingGroups in the code. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
The error message says that the total number of slots is 0,
It is thus very likely that no task manager is connected to the jobmanager. How exactly are you starting the cluster? On 14.09.2017 18:03, AndreaKinn wrote: > Hi, > I'm executing a program on a flink cluster. > I tried the same on a local node with Eclipse and it worked fine. > > To start, following Flink recommendations on the cluster I set > numberOfTaskSlots equals to the Cpu cores (2) while I set parallelism to 1. > Unfortunately when I try to execute I obtain > > > Caused by: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Not enough free slots available to run the job. You can decrease the > operator parallelism or increase the number of slots per TaskManager in the > configuration. Task to schedule: < Attempt #1 (Source: Custom Source -> > Timestamps/Watermarks (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < > cbc357ccb763df2852fee8c4fc7d55f2 > in sharing group < SlotSharingGroup > [e883208d19e3c34f8aaf2a3168a63337, 9dd63673dd41ea021b896d5203f3ba7c, > cbc357ccb763df2852fee8c4fc7d55f2] >. Resources available to scheduler: > Number of instances=0, total number of slots=0, available slots=0 > > > As you can see it says I have 0 available slots... how is this possible?!? > I set no chains or sharingGroups in the code. > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > |
Update.
the previous error probably was caused because I didn't restart the cluster before a re-execution. (maybe) Then, I tried to execute the program on a cluster of one node on my laptop and, after solved some little issues, everything works fine. Now I'm trying to deploy the same jar on the real cluster. Initially everything seems to work correctly. giordano@giordano-2-2-100-1:~$ ./flink-1.3.2/bin/flink run flink-java-project-0.1.jar Cluster configuration: Standalone cluster with JobManager at localhost/127.0.0.1:6123 Using address localhost:6123 to connect to JobManager. JobManager web interface address http://localhost:8081 Starting execution of program Submitting job with JobID: 161d91dda7c7012c8f48fa8a104a1662. Waiting for job completion. Connected to JobManager at Actor[akka.tcp://flink@localhost:6123/user/jobmanager#430534598] with leader session id 00000000-0000-0000-0000-000000000000. 09/14/2017 22:05:00 Job execution switched to status RUNNING. 09/14/2017 22:05:00 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to SCHEDULED 09/14/2017 22:05:00 Map -> Sink: Unnamed(1/1) switched to SCHEDULED 09/14/2017 22:05:00 Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink)(1/1) switched to SCHEDULED 09/14/2017 22:05:00 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to DEPLOYING 09/14/2017 22:05:00 Map -> Sink: Unnamed(1/1) switched to DEPLOYING 09/14/2017 22:05:00 Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink)(1/1) switched to DEPLOYING 09/14/2017 22:05:01 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to RUNNING 09/14/2017 22:05:01 Map -> Sink: Unnamed(1/1) switched to RUNNING 09/14/2017 22:05:01 Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink)(1/1) switched to RUNNING Unfortunately, after a minute about, the job fails: 09/14/2017 22:06:53 Map -> Sink: Unnamed(1/1) switched to FAILED java.lang.Exception: TaskManager was lost/killed: 413eda6bf77223085c59e104680259bc @ giordano-2-2-100-1 (dataPort=36498) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 09/14/2017 22:06:53 Job execution switched to status FAILING. java.lang.Exception: TaskManager was lost/killed: 413eda6bf77223085c59e104680259bc @ giordano-2-2-100-1 (dataPort=36498) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 09/14/2017 22:06:53 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to CANCELING 09/14/2017 22:06:53 Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink)(1/1) switched to CANCELING 09/14/2017 22:06:53 Source: Custom Source -> Timestamps/Watermarks(1/1) switched to CANCELED 09/14/2017 22:06:53 Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink)(1/1) switched to CANCELED Then the job is restarted but shows again the NoResourceAvailable error. I start the cluster using start-cluster.sh script and everything works fine starting task managers also in the other node. I set on every nodes number of task slots equal to core number (2) while parallelism key is commented. On the master node (it works as jobmanager and taskmanager) I set jobmanager.heap.mb: 756 taskmanager.heap.mb:756 (I have 2GB of Ram on it) while on the other two nodes: taskmanager.heap.mb:1512 (I have 2GB of Ram on them) Hints? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
P.S.: I tried on my laptop with the same configuration of the job-task
manager (ram, slots, parallelism etc...) and it works perfectly. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi,
Can you check in the TaskManager logs whether there is any message that indicates why the TaskManager was lost? Also, there might be information in your machine logs, i.e. "dmesg" or /var/log/messages or some such. Best, Aljoscha > On 14. Sep 2017, at 22:28, AndreaKinn <[hidden email]> wrote: > > P.S.: I tried on my laptop with the same configuration of the job-task > manager (ram, slots, parallelism etc...) and it works perfectly. > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
This is the log:
2017-09-15 12:47:49,143 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classe$ 2017-09-15 12:47:49,257 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 2017-09-15 12:47:49,257 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 @ 10:23:11 UTC) 2017-09-15 12:47:49,257 INFO org.apache.flink.runtime.taskmanager.TaskManager - Current user: giordano 2017-09-15 12:47:49,257 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11 2017-09-15 12:47:49,258 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum heap size: 502 MiBytes 2017-09-15 12:47:49,258 INFO org.apache.flink.runtime.taskmanager.TaskManager - JAVA_HOME: /usr/lib/jvm/java-8-oracle 2017-09-15 12:47:49,261 INFO org.apache.flink.runtime.taskmanager.TaskManager - Hadoop version: 2.7.2 2017-09-15 12:47:49,261 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM Options: 2017-09-15 12:47:49,261 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+UseG1GC 2017-09-15 12:47:49,261 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog.file=/home/giordano/flink-1.3.2/log/flink-giordano-taskmanager-0-giordano$ 2017-09-15 12:47:49,261 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog4j.configuration=file:/home/giordano/flink-1.3.2/conf/log4j.properties 2017-09-15 12:47:49,261 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlogback.configurationFile=file:/home/giordano/flink-1.3.2/conf/logback.xml 2017-09-15 12:47:49,262 INFO org.apache.flink.runtime.taskmanager.TaskManager - Program Arguments: 2017-09-15 12:47:49,262 INFO org.apache.flink.runtime.taskmanager.TaskManager - --configDir 2017-09-15 12:47:49,262 INFO org.apache.flink.runtime.taskmanager.TaskManager - /home/giordano/flink-1.3.2/conf 2017-09-15 12:47:49,262 INFO org.apache.flink.runtime.taskmanager.TaskManager - Classpath: /home/giordano/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar:/home/giorda$ 2017-09-15 12:47:49,262 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 2017-09-15 12:47:49,263 INFO org.apache.flink.runtime.taskmanager.TaskManager - Registered UNIX signal handlers for [TERM, HUP, INT] 2017-09-15 12:47:49,269 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum number of open file descriptors is 65536 2017-09-15 12:47:49,295 INFO org.apache.flink.runtime.taskmanager.TaskManager - Loading configuration from /home/giordano/flink-1.3.2/conf 2017-09-15 12:47:49,298 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /usr/lib/jvm/java-8-oracle 2017-09-15 12:47:49,299 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost 2017-09-15 12:47:49,299 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2017-09-15 12:47:49,299 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 512 2017-09-15 12:47:49,299 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 2 2017-09-15 12:47:49,299 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2017-09-15 12:47:49,300 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081 2017-09-15 12:47:49,300 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2017-09-15 12:47:49,300 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, file:///home/flink-$ 2017-09-15 12:47:49,311 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /usr/lib/jvm/java-8-oracle 2017-09-15 12:47:49,312 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost 2017-09-15 12:47:49,312 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2017-09-15 12:47:49,312 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 512 2017-09-15 12:47:49,312 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 2 2017-09-15 12:47:49,312 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2017-09-15 12:47:49,313 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081 2017-09-15 12:47:49,313 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2017-09-15 12:47:49,313 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, file:///home/flink-$ 2017-09-15 12:47:49,386 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to giordano (auth:SIMPLE) 2017-09-15 12:47:49,463 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the lead$ 2017-09-15 12:47:49,463 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuri$ 2017-09-15 12:47:49,466 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address localhost/127.0.0.1:6123. 2017-09-15 12:47:49,477 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'giordano-2-2-100-1' (192.168.11.56) for comm$ 2017-09-15 12:47:49,478 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager 2017-09-15 12:47:49,479 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at giordano-2-2-100-1:0. 2017-09-15 12:47:49,989 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2017-09-15 12:47:50,053 INFO Remoting - Starting remoting 2017-09-15 12:47:50,290 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@giordano-2-2-100-1:3512$ 2017-09-15 12:47:50,301 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor 2017-09-15 12:47:50,323 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: giordano-2-2-100-1/192.168.11.56, server port: 0, ssl $ 2017-09-15 12:47:50,331 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms 2017-09-15 12:47:50,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 99 GB, usable 95 GB (95.96% usable) 2017-09-15 12:47:50,534 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per$ 2017-09-15 12:47:50,800 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components. 2017-09-15 12:47:50,816 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). 2017-09-15 12:47:53,827 WARN io.netty.util.internal.ThreadLocalRandom - Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy? 2017-09-15 12:47:53,866 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 3049 ms). Listening on SocketAddress /192.168.11.56$ 2017-09-15 12:47:53,977 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (301 MB), memory wi$ 2017-09-15 12:47:53,986 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-75aed96d-28e9-4bcb-8d9b-4de0e734890d for s$ 2017-09-15 12:47:53,998 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported. 2017-09-15 12:47:54,114 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-5d60853e-9225-4438-9b07-ce6db2$ 2017-09-15 12:47:54,128 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-c00e60e1-3786-45ef-b1bc-570df5$ 2017-09-15 12:47:54,140 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#523808577. 2017-09-15 12:47:54,141 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: cf04d1390ff86aba4d1702ef1a0d2b67 @ giordan$ 2017-09-15 12:47:54,141 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 2 task slot(s). 2017-09-15 12:47:54,143 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 74/197/502 MB, NON HEAP: 33/34/-1 MB (used/committed/max$ 2017-09-15 12:47:54,148 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@localhost:6123/user/jobmanager (a$ 2017-09-15 12:47:54,430 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@localhost:6123/user/jobmana$ 2017-09-15 12:47:54,440 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be localhost/127.0.0.1:39682. Starting BLOB cache. 2017-09-15 12:47:54,448 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-1c075944-0152-42aa-b64a-607b931$ 2017-09-15 12:48:02,066 INFO org.apache.flink.runtime.taskmanager.TaskManager - Received task Source: Custom Source -> Timestamps/Watermarks (1/1) 2017-09-15 12:48:02,081 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Timestamps/Watermarks (1/1) (bc0e95e951deb6680cff372a95495$ 2017-09-15 12:48:02,081 INFO org.apache.flink.runtime.taskmanager.Task - Creating FileSystem stream leak safety net for task Source: Custom Source -> Timest$ 2017-09-15 12:48:02,085 INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task Source: Custom Source -> Timestamps/Watermarks (1/1) (bc$ 2017-09-15 12:48:02,086 INFO org.apache.flink.runtime.blob.BlobCache - Downloading 180bc9dfc19f185ab48acbc5bec0568e15a41665 from localhost/127.0.0.1:39682 2017-09-15 12:48:02,088 INFO org.apache.flink.runtime.taskmanager.TaskManager - Received task Map -> Sink: Unnamed (1/1) 2017-09-15 12:48:02,103 INFO org.apache.flink.runtime.taskmanager.TaskManager - Received task Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra $ 2017-09-15 12:48:02,107 INFO org.apache.flink.runtime.taskmanager.Task - Map -> Sink: Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from CREATED$ 2017-09-15 12:48:02,106 INFO org.apache.flink.runtime.taskmanager.Task - Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) (d$ 2017-09-15 12:48:02,115 INFO org.apache.flink.runtime.taskmanager.Task - Creating FileSystem stream leak safety net for task Map -> Sink: Unnamed (1/1) (7ec$ 2017-09-15 12:48:02,116 INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task Map -> Sink: Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b$ 2017-09-15 12:48:02,166 INFO org.apache.flink.runtime.taskmanager.Task - Creating FileSystem stream leak safety net for task Learn -> Select -> Process -> ($ 2017-09-15 12:48:02,171 INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task Learn -> Select -> Process -> (Sink: Cassandra Sink, Sin$ 2017-09-15 12:48:02,988 INFO org.apache.flink.runtime.taskmanager.Task - Registering task at network: Learn -> Select -> Process -> (Sink: Cassandra Sink, S$ 2017-09-15 12:48:02,988 INFO org.apache.flink.runtime.taskmanager.Task - Registering task at network: Source: Custom Source -> Timestamps/Watermarks (1/1) ($ 2017-09-15 12:48:02,991 INFO org.apache.flink.runtime.taskmanager.Task - Registering task at network: Map -> Sink: Unnamed (1/1) (7ecc9cea7b0132f9604bce9954$ 2017-09-15 12:48:02,997 INFO org.apache.flink.runtime.taskmanager.Task - Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) (d$ 2017-09-15 12:48:03,005 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Timestamps/Watermarks (1/1) (bc0e95e951deb6680cff372a95495$ 2017-09-15 12:48:03,008 INFO org.apache.flink.runtime.taskmanager.Task - Map -> Sink: Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from DEPLOYI$ 2017-09-15 12:48:03,078 INFO org.apache.flink.streaming.runtime.tasks.StreamTask - State backend is set to heap memory (checkpoints to filesystem "file:/home/flink-ch$ 2017-09-15 12:48:03,078 INFO org.apache.flink.streaming.runtime.tasks.StreamTask - State backend is set to heap memory (checkpoints to filesystem "file:/home/flink-ch$ 2017-09-15 12:48:03,082 INFO org.apache.flink.streaming.runtime.tasks.StreamTask - State backend is set to heap memory (checkpoints to filesystem "file:/home/flink-ch$ 2017-09-15 12:48:03,427 INFO org.apache.flink.runtime.state.heap.HeapKeyedStateBackend - Initializing heap keyed state backend with stream factory. 2017-09-15 12:48:03,439 WARN org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'latency'. Metric wil$ 2017-09-15 12:48:03,443 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - No restore state for FlinkKafkaConsumer. 2017-09-15 12:48:03,519 INFO org.apache.flink.runtime.state.heap.HeapKeyedStateBackend - Initializing heap keyed state backend with stream factory. 2017-09-15 12:48:03,545 INFO org.apache.kafka.clients.consumer.ConsumerConfig - ConsumerConfig values: metric.reporters = [] metadata.max.age.ms = 300000 partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor] reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 max.partition.fetch.bytes = 1048576 bootstrap.servers = [147.83.29.146:55091] ssl.keystore.type = JKS enable.auto.commit = true sasl.mechanism = GSSAPI interceptor.classes = null exclude.internal.topics = true ssl.truststore.password = null client.id = ssl.endpoint.identification.algorithm = null max.poll.records = 2147483647 check.crcs = true request.timeout.ms = 40000 heartbeat.interval.ms = 3000 auto.commit.interval.ms = 5000 receive.buffer.bytes = 65536 ssl.truststore.type = JKS ssl.truststore.location = null ssl.keystore.password = null fetch.min.bytes = 1 send.buffer.bytes = 131072 value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer group.id = groupId retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.trustmanager.algorithm = PKIX ssl.key.password = null fetch.max.wait.ms = 500 sasl.kerberos.min.time.before.relogin = 60000 connections.max.idle.ms = 540000 session.timeout.ms = 30000 metrics.num.samples = 2 key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer ssl.protocol = TLS ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.keystore.location = null ssl.cipher.suites = null security.protocol = PLAINTEXT ssl.keymanager.algorithm = SunX509 metrics.sample.window.ms = 30000 auto.offset.reset = latest 2017-09-15 12:48:03,642 INFO org.apache.kafka.clients.consumer.ConsumerConfig - ConsumerConfig values: metric.reporters = [] metadata.max.age.ms = 300000 partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor] reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 max.partition.fetch.bytes = 1048576 bootstrap.servers = [147.83.29.146:55091] ssl.keystore.type = JKS enable.auto.commit = true sasl.mechanism = GSSAPI interceptor.classes = null exclude.internal.topics = true ssl.truststore.password = null client.id = consumer-1 ssl.endpoint.identification.algorithm = null max.poll.records = 2147483647 check.crcs = true request.timeout.ms = 40000 heartbeat.interval.ms = 3000 auto.commit.interval.ms = 5000 receive.buffer.bytes = 65536 ssl.truststore.type = JKS ssl.truststore.location = null ssl.keystore.password = null fetch.min.bytes = 1 send.buffer.bytes = 131072 value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer group.id = groupId retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.trustmanager.algorithm = PKIX ssl.key.password = null fetch.max.wait.ms = 500 sasl.kerberos.min.time.before.relogin = 60000 connections.max.idle.ms = 540000 session.timeout.ms = 30000 metrics.num.samples = 2 key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer ssl.protocol = TLS ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.keystore.location = null ssl.cipher.suites = null security.protocol = PLAINTEXT ssl.keymanager.algorithm = SunX509 metrics.sample.window.ms = 30000 auto.offset.reset = latest 2017-09-15 12:48:03,696 INFO com.datastax.driver.core.GuavaCompatibility - Detected Guava < 19 in the classpath, using legacy compatibility layer 2017-09-15 12:48:03,716 INFO org.apache.kafka.common.utils.AppInfoParser - Kafka version : 0.10.0.1 2017-09-15 12:48:03,718 INFO org.apache.kafka.common.utils.AppInfoParser - Kafka commitId : a7a17cdec9eaa6c5 2017-09-15 12:48:03,998 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09 - Got 10 partitions from these topics: [LCacc] 2017-09-15 12:48:03,999 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09 - Consumer is going to read the following topics (with number of partitions): LCa$ 2017-09-15 12:48:04,099 INFO org.apache.kafka.clients.consumer.ConsumerConfig - ConsumerConfig values: metric.reporters = [] metadata.max.age.ms = 300000 partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor] reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 max.partition.fetch.bytes = 1048576 bootstrap.servers = [147.83.29.146:55091] ssl.keystore.type = JKS enable.auto.commit = true sasl.mechanism = GSSAPI interceptor.classes = null exclude.internal.topics = true ssl.truststore.password = null client.id = ssl.endpoint.identification.algorithm = null max.poll.records = 2147483647 check.crcs = true request.timeout.ms = 40000 heartbeat.interval.ms = 3000 auto.commit.interval.ms = 5000 receive.buffer.bytes = 65536 ssl.truststore.type = JKS ssl.truststore.location = null ssl.keystore.password = null fetch.min.bytes = 1 send.buffer.bytes = 131072 value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer group.id = groupId retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.trustmanager.algorithm = PKIX ssl.key.password = null fetch.max.wait.ms = 500 sasl.kerberos.min.time.before.relogin = 60000 connections.max.idle.ms = 540000 session.timeout.ms = 30000 metrics.num.samples = 2 key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer ssl.protocol = TLS ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.keystore.location = null ssl.cipher.suites = null security.protocol = PLAINTEXT ssl.keymanager.algorithm = SunX509 metrics.sample.window.ms = 30000 auto.offset.reset = latest 2017-09-15 12:48:04,111 INFO org.apache.kafka.clients.consumer.ConsumerConfig - ConsumerConfig values: metric.reporters = [] metadata.max.age.ms = 300000 partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor] reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 max.partition.fetch.bytes = 1048576 bootstrap.servers = [147.83.29.146:55091] ssl.keystore.type = JKS enable.auto.commit = true sasl.mechanism = GSSAPI interceptor.classes = null exclude.internal.topics = true ssl.truststore.password = null client.id = consumer-2 ssl.endpoint.identification.algorithm = null max.poll.records = 2147483647 check.crcs = true request.timeout.ms = 40000 heartbeat.interval.ms = 3000 auto.commit.interval.ms = 5000 receive.buffer.bytes = 65536 ssl.truststore.type = JKS ssl.truststore.location = null ssl.keystore.password = null fetch.min.bytes = 1 send.buffer.bytes = 131072 value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer group.id = groupId retry.backoff.ms = 100 sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.trustmanager.algorithm = PKIX ssl.key.password = null fetch.max.wait.ms = 500 sasl.kerberos.min.time.before.relogin = 60000 connections.max.idle.ms = 540000 session.timeout.ms = 30000 metrics.num.samples = 2 key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer ssl.protocol = TLS ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.keystore.location = null ssl.cipher.suites = null security.protocol = PLAINTEXT ssl.keymanager.algorithm = SunX509 metrics.sample.window.ms = 30000 auto.offset.reset = latest 2017-09-15 12:48:04,127 INFO org.apache.kafka.common.utils.AppInfoParser - Kafka version : 0.10.0.1 2017-09-15 12:48:04,127 INFO org.apache.kafka.common.utils.AppInfoParser - Kafka commitId : a7a17cdec9eaa6c5 2017-09-15 12:48:04,302 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - Discovered coordinator 147.83.29.146:55091 (id: 2147483647 rack: null) for group$ 2017-09-15 12:48:04,371 INFO com.datastax.driver.core.ClockFactory - Using native clock to generate timestamps. 2017-09-15 12:48:04,445 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-0 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,476 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-1 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,509 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-2 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,540 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-3 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,565 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-4 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,596 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-5 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,619 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-6 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,655 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-7 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,688 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-8 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,720 INFO org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Partition LCacc-9 has no initial offset; the consumer has position 0, so the$ 2017-09-15 12:48:04,896 INFO com.datastax.driver.core.NettyUtil - Found Netty's native epoll transport in the classpath, using it 2017-09-15 12:48:05,557 INFO com.datastax.driver.core.policies.DCAwareRoundRobinPolicy - Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorr 2017-09-15 12:48:05,559 INFO com.datastax.driver.core.Cluster - New Cassandra host /147.83.29.146:55092 added 2017-09-15 12:48:05,681 INFO com.datastax.driver.core.ClockFactory - Using native clock to generate timestamps. 2017-09-15 12:48:05,819 INFO com.datastax.driver.core.policies.DCAwareRoundRobinPolicy - Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the$ 2017-09-15 12:48:05,819 INFO com.datastax.driver.core.Cluster - New Cassandra host /147.83.29.146:55092 added 2017-09-15 12:48:10,630 INFO org.numenta.nupic.flink.streaming.api.operator.AbstractHTMInferenceOperator - Created HTM network DayDemoNetwork 2017-09-15 12:48:10,675 WARN org.numenta.nupic.network.Layer - The number of Input Dimensions (1) != number of Column Dimensions (1) --OR-- Encoder width (2350) != produ$ 2017-09-15 12:48:10,678 INFO org.numenta.nupic.network.Layer - Input dimension fix successful! 2017-09-15 12:48:10,678 INFO org.numenta.nupic.network.Layer - Using calculated input dimensions: [2350] 2017-09-15 12:48:10,679 INFO org.numenta.nupic.network.Layer - Classifying "value" input field with CLAClassifier In /dmsg there is nothing to show. After the start of the execution there are no described errors. Anyway, in the time before it crashes the job is not executed really and cpu is about 100% (verified with top command) -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
In reply to this post by AndreaKinn
the job manager log probably is more interesting:
2017-09-15 12:47:45,420 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager - -------------------------------------------------------------------------------- 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 @ 10:23:11 UTC) 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: giordano 2017-09-15 12:47:45,651 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.131-b11 2017-09-15 12:47:45,652 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 491 MiBytes 2017-09-15 12:47:45,652 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: /usr/lib/jvm/java-8-oracle 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - Hadoop version: 2.7.2 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options: 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms512m 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx512m 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog.file=/home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-giordano-2-2-100-1.log 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/home/giordano/flink-1.3.2/conf/log4j.properties 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/home/giordano/flink-1.3.2/conf/logback.xml 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments: 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager - /home/giordano/flink-1.3.2/conf 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager - cluster 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager - Classpath: /home/giordano/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar:/home/giordano/flink-1.3.2/lib/flin$ 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager - -------------------------------------------------------------------------------- 2017-09-15 12:47:45,661 INFO org.apache.flink.runtime.jobmanager.JobManager - Registered UNIX signal handlers for [TERM, HUP, INT] 2017-09-15 12:47:45,947 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /home/giordano/flink-1.3.2/conf 2017-09-15 12:47:45,953 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /usr/lib/jvm/java-8-oracle 2017-09-15 12:47:45,953 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost 2017-09-15 12:47:45,953 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2017-09-15 12:47:45,954 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 512 2017-09-15 12:47:45,954 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 2 2017-09-15 12:47:45,954 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2017-09-15 12:47:45,955 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081 2017-09-15 12:47:45,956 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2017-09-15 12:47:45,956 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, file:///home/flink-checkpoints 2017-09-15 12:47:45,970 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager without high-availability 2017-09-15 12:47:45,973 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager on localhost:6123 with execution mode CLUSTER 2017-09-15 12:47:45,993 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /usr/lib/jvm/java-8-oracle 2017-09-15 12:47:45,995 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost 2017-09-15 12:47:45,995 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2017-09-15 12:47:45,995 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 512 2017-09-15 12:47:45,995 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 2 2017-09-15 12:47:45,996 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2017-09-15 12:47:45,996 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081 2017-09-15 12:47:45,996 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2017-09-15 12:47:45,996 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, file:///home/flink-checkpoints 2017-09-15 12:47:46,045 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to giordano (auth:SIMPLE) 2017-09-15 12:47:46,209 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system reachable at localhost:6123 2017-09-15 12:47:46,878 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2017-09-15 12:47:46,982 INFO Remoting - Starting remoting 2017-09-15 12:47:47,392 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@localhost:6123] 2017-09-15 12:47:47,423 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager web frontend 2017-09-15 12:47:47,433 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of JobManager log file: /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-gio$ 2017-09-15 12:47:47,434 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of JobManager stdout file: /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-$ 2017-09-15 12:47:47,434 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-f9816186-7918-4475-8359-c3cafb63559a for the web interface files 2017-09-15 12:47:47,435 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-75aba058-8587-4181-bf16-ee2e68d9b70c for web frontend JAR file uploads 2017-09-15 12:47:47,853 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:8081 2017-09-15 12:47:47,854 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor 2017-09-15 12:47:47,876 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-9ad4e807-069e-4fb7-88b3-410fbdcb5eb0 2017-09-15 12:47:47,879 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:39682 - max concurrent requests: 50 - max backlog: 1000 2017-09-15 12:47:47,896 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported. 2017-09-15 12:47:47,906 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive 2017-09-15 12:47:47,914 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink@localhost:6123/user/jobmanager. 2017-09-15 12:47:47,929 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink@localhost:6123/user/jobmanager on port 8081 2017-09-15 12:47:47,930 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@localhost:6123/user/jobmanager:00000000-0000-0000-0000-0000000$ 2017-09-15 12:47:47,970 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink@localhost:6123/user/jobmanag$ 2017-09-15 12:47:48,010 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@localhost:6123/user/jobmanager was granted leadership with leader session ID S$ 2017-09-15 12:47:48,034 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#$ 2017-09-15 12:47:51,071 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:51,383 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:51,718 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:52,109 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:52,414 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:53,130 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:54,365 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - TaskManager cf04d1390ff86aba4d1702ef1a0d2b67 has started. 2017-09-15 12:47:54,368 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at giordano-2-2-100-1 (akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager) $ 2017-09-15 12:47:54,434 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:55,150 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:58,455 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:47:59,172 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:48:01,650 INFO org.apache.flink.runtime.jobmanager.JobManager - Submitting job df1f60ca168364759c69dbe078544346 (Flink Streaming Job). 2017-09-15 12:48:01,768 INFO org.apache.flink.runtime.jobmanager.JobManager - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=1, delayBetweenRestartAttempts=0$ 2017-09-15 12:48:01,782 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers via failover strategy: full graph restart 2017-09-15 12:48:01,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Running initialization on master for job Flink Streaming Job (df1f60ca168364759c69dbe078544346). 2017-09-15 12:48:01,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Successfully ran initialization on master in 0 ms. 2017-09-15 12:48:01,830 INFO org.apache.flink.runtime.jobmanager.JobManager - State backend is set to heap memory (checkpoints to filesystem "file:/home/flink-checkpoints") 2017-09-15 12:48:01,843 INFO org.apache.flink.runtime.jobmanager.JobManager - Scheduling job df1f60ca168364759c69dbe078544346 (Flink Streaming Job). 2017-09-15 12:48:01,844 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink Streaming Job (df1f60ca168364759c69dbe078544346) switched from state CREATED to RUNNING. 2017-09-15 12:48:01,861 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Timestamps/Watermarks (1/1) (bc0e95e951deb6680cff372a954950d2) switched from CREA$ 2017-09-15 12:48:01,882 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from CREATED to SCHEDULED. 2017-09-15 12:48:01,883 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) (d24f111f9720c8e4df77f67a$ 2017-09-15 12:48:01,895 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Timestamps/Watermarks (1/1) (bc0e95e951deb6680cff372a954950d2) switched from SCHE$ 2017-09-15 12:48:01,912 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: Custom Source -> Timestamps/Watermarks (1/1) (attempt #0) to giordano-2-2-100-1 2017-09-15 12:48:01,933 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from SCHEDULED to DEPLOYING. 2017-09-15 12:48:01,933 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Map -> Sink: Unnamed (1/1) (attempt #0) to giordano-2-2-100-1 2017-09-15 12:48:01,941 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) (d24f111f9720c8e4df77f67a$ 2017-09-15 12:48:01,941 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) (attempt #0) to$ 2017-09-15 12:48:03,018 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) (d24f111f9720c8e4df77f67a$ 2017-09-15 12:48:03,038 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Timestamps/Watermarks (1/1) (bc0e95e951deb6680cff372a954950d2) switched from DEPL$ 2017-09-15 12:48:03,046 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from DEPLOYING to RUNNING. 2017-09-15 12:48:06,474 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:48:07,189 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:48:22,495 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ $flink@giordano-2-2-100-1:6123/]] arriving at [akka.tcp://flink@giordano-2-2-100-1:6123] inbound addresses are [akka.tcp://flink@localhost:6123] 2017-09-15 12:48:52,506 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:48:53,212 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:49:22,524 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:49:23,230 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:49:52,544 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@$ 2017-09-15 12:49:59,410 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://flink@giordano-2-2-100-1:35127] 2017-09-15 12:49:59,445 INFO org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager terminated. 2017-09-15 12:49:59,446 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Timestamps/Watermarks (1/1) (bc0e95e951deb6680cff372a954950d2) switched from RUNN$ java.lang.Exception: TaskManager was lost/killed: cf04d1390ff86aba4d1702ef1a0d2b67 @ giordano-2-2-100-1 (dataPort=36806) it tag as unreachable the task manager (who reside on the same node...) -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
I think it might be that the computation is to CPU heavy, which makes the TaskManager unresponsive to any JobManager messages and so the JobManager thinks that the TaskManager is lost.
@Till, do you have another idea about what could be going on? > On 15. Sep 2017, at 13:52, AndreaKinn <[hidden email]> wrote: > > the job manager log probably is more interesting: > > 2017-09-15 12:47:45,420 WARN org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -------------------------------------------------------------------------------- > 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 @ > 10:23:11 UTC) > 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager > - Current user: giordano > 2017-09-15 12:47:45,651 INFO org.apache.flink.runtime.jobmanager.JobManager > - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - > 1.8/25.131-b11 > 2017-09-15 12:47:45,652 INFO org.apache.flink.runtime.jobmanager.JobManager > - Maximum heap size: 491 MiBytes > 2017-09-15 12:47:45,652 INFO org.apache.flink.runtime.jobmanager.JobManager > - JAVA_HOME: /usr/lib/jvm/java-8-oracle > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - Hadoop version: 2.7.2 > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - JVM Options: > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - -Xms512m > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - -Xmx512m > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -Dlog.file=/home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-giordano-2-2-100-1.log > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -Dlog4j.configuration=file:/home/giordano/flink-1.3.2/conf/log4j.properties > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -Dlogback.configurationFile=file:/home/giordano/flink-1.3.2/conf/logback.xml > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > - Program Arguments: > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > - --configDir > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > - /home/giordano/flink-1.3.2/conf > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > - --executionMode > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > - cluster > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > - Classpath: > /home/giordano/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar:/home/giordano/flink-1.3.2/lib/flin$ > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > - > -------------------------------------------------------------------------------- > 2017-09-15 12:47:45,661 INFO org.apache.flink.runtime.jobmanager.JobManager > - Registered UNIX signal handlers for [TERM, HUP, INT] > 2017-09-15 12:47:45,947 INFO org.apache.flink.runtime.jobmanager.JobManager > - Loading configuration from /home/giordano/flink-1.3.2/conf > 2017-09-15 12:47:45,953 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: env.java.home, /usr/lib/jvm/java-8-oracle > 2017-09-15 12:47:45,953 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, localhost > 2017-09-15 12:47:45,953 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2017-09-15 12:47:45,954 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.mb, 512 > 2017-09-15 12:47:45,954 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 2 > 2017-09-15 12:47:45,954 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.preallocate, false > 2017-09-15 12:47:45,955 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.web.port, 8081 > 2017-09-15 12:47:45,956 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend, filesystem > 2017-09-15 12:47:45,956 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend.fs.checkpointdir, > file:///home/flink-checkpoints > 2017-09-15 12:47:45,970 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager without high-availability > 2017-09-15 12:47:45,973 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager on localhost:6123 with execution mode CLUSTER > 2017-09-15 12:47:45,993 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: env.java.home, /usr/lib/jvm/java-8-oracle > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, localhost > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.mb, 512 > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 2 > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.preallocate, false > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.web.port, 8081 > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend, filesystem > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend.fs.checkpointdir, > file:///home/flink-checkpoints > 2017-09-15 12:47:46,045 INFO > org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user > set to giordano (auth:SIMPLE) > 2017-09-15 12:47:46,209 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager actor system reachable at localhost:6123 > 2017-09-15 12:47:46,878 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 2017-09-15 12:47:46,982 INFO Remoting > - Starting remoting > 2017-09-15 12:47:47,392 INFO Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink@localhost:6123] > 2017-09-15 12:47:47,423 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager web frontend > 2017-09-15 12:47:47,433 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager log file: > /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-gio$ > 2017-09-15 12:47:47,434 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager stdout file: > /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-$ > 2017-09-15 12:47:47,434 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using > directory /tmp/flink-web-f9816186-7918-4475-8359-c3cafb63559a for the web > interface files > 2017-09-15 12:47:47,435 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using > directory /tmp/flink-web-75aba058-8587-4181-bf16-ee2e68d9b70c for web > frontend JAR file uploads > 2017-09-15 12:47:47,853 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend > listening at 0:0:0:0:0:0:0:0:8081 > 2017-09-15 12:47:47,854 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager actor > 2017-09-15 12:47:47,876 INFO org.apache.flink.runtime.blob.BlobServer > - Created BLOB server storage directory > /tmp/blobStore-9ad4e807-069e-4fb7-88b3-410fbdcb5eb0 > 2017-09-15 12:47:47,879 INFO org.apache.flink.runtime.blob.BlobServer > - Started BLOB server at 0.0.0.0:39682 - max concurrent requests: 50 - max > backlog: 1000 > 2017-09-15 12:47:47,896 INFO > org.apache.flink.runtime.metrics.MetricRegistry - No metrics > reporter configured, no metrics will be exposed/reported. > 2017-09-15 12:47:47,906 INFO > org.apache.flink.runtime.jobmanager.MemoryArchivist - Started > memory archivist akka://flink/user/archive > 2017-09-15 12:47:47,914 INFO org.apache.flink.runtime.jobmanager.JobManager > - Starting JobManager at akka.tcp://flink@localhost:6123/user/jobmanager. > 2017-09-15 12:47:47,929 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting > with JobManager akka.tcp://flink@localhost:6123/user/jobmanager on port 8081 > 2017-09-15 12:47:47,930 INFO > org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader > reachable under > akka.tcp://flink@localhost:6123/user/jobmanager:00000000-0000-0000-0000-0000000$ > 2017-09-15 12:47:47,970 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Trying to associate with JobManager leader > akka.tcp://flink@localhost:6123/user/jobmanag$ > 2017-09-15 12:47:48,010 INFO org.apache.flink.runtime.jobmanager.JobManager > - JobManager akka.tcp://flink@localhost:6123/user/jobmanager was granted > leadership with leader session ID S$ > 2017-09-15 12:47:48,034 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Resource Manager associating with leading JobManager > Actor[akka://flink/user/jobmanager#$ > 2017-09-15 12:47:51,071 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:51,383 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:51,718 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:52,109 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:52,414 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:53,130 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:54,365 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - TaskManager cf04d1390ff86aba4d1702ef1a0d2b67 has started. > 2017-09-15 12:47:54,368 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at giordano-2-2-100-1 > (akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager) $ > 2017-09-15 12:47:54,434 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:55,150 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:58,455 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:59,172 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:01,650 INFO org.apache.flink.runtime.jobmanager.JobManager > - Submitting job df1f60ca168364759c69dbe078544346 (Flink Streaming Job). > 2017-09-15 12:48:01,768 INFO org.apache.flink.runtime.jobmanager.JobManager > - Using restart strategy > FixedDelayRestartStrategy(maxNumberRestartAttempts=1, > delayBetweenRestartAttempts=0$ > 2017-09-15 12:48:01,782 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers > via failover strategy: full graph restart > 2017-09-15 12:48:01,796 INFO org.apache.flink.runtime.jobmanager.JobManager > - Running initialization on master for job Flink Streaming Job > (df1f60ca168364759c69dbe078544346). > 2017-09-15 12:48:01,796 INFO org.apache.flink.runtime.jobmanager.JobManager > - Successfully ran initialization on master in 0 ms. > 2017-09-15 12:48:01,830 INFO org.apache.flink.runtime.jobmanager.JobManager > - State backend is set to heap memory (checkpoints to filesystem > "file:/home/flink-checkpoints") > 2017-09-15 12:48:01,843 INFO org.apache.flink.runtime.jobmanager.JobManager > - Scheduling job df1f60ca168364759c69dbe078544346 (Flink Streaming Job). > 2017-09-15 12:48:01,844 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink > Streaming Job (df1f60ca168364759c69dbe078544346) switched from state CREATED > to RUNNING. > 2017-09-15 12:48:01,861 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from CREA$ > 2017-09-15 12:48:01,882 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: > Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from CREATED to > SCHEDULED. > 2017-09-15 12:48:01,883 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> > Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) > (d24f111f9720c8e4df77f67a$ > 2017-09-15 12:48:01,895 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from SCHE$ > 2017-09-15 12:48:01,912 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying > Source: Custom Source -> Timestamps/Watermarks (1/1) (attempt #0) to > giordano-2-2-100-1 > 2017-09-15 12:48:01,933 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: > Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from SCHEDULED to > DEPLOYING. > 2017-09-15 12:48:01,933 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying > Map -> Sink: Unnamed (1/1) (attempt #0) to giordano-2-2-100-1 > 2017-09-15 12:48:01,941 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> > Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) > (d24f111f9720c8e4df77f67a$ > 2017-09-15 12:48:01,941 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying > Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) > (1/1) (attempt #0) to$ > 2017-09-15 12:48:03,018 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> > Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) > (d24f111f9720c8e4df77f67a$ > 2017-09-15 12:48:03,038 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from DEPL$ > 2017-09-15 12:48:03,046 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: > Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from DEPLOYING to > RUNNING. > 2017-09-15 12:48:06,474 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:07,189 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:22,495 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > $flink@giordano-2-2-100-1:6123/]] arriving at > [akka.tcp://flink@giordano-2-2-100-1:6123] inbound addresses are > [akka.tcp://flink@localhost:6123] > 2017-09-15 12:48:52,506 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:53,212 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:22,524 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:23,230 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:52,544 ERROR akka.remote.EndpointWriter > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:59,410 WARN akka.remote.RemoteWatcher > - Detected unreachable: [akka.tcp://flink@giordano-2-2-100-1:35127] > 2017-09-15 12:49:59,445 INFO org.apache.flink.runtime.jobmanager.JobManager > - Task manager akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager > terminated. > 2017-09-15 12:49:59,446 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from RUNN$ > java.lang.Exception: TaskManager was lost/killed: > cf04d1390ff86aba4d1702ef1a0d2b67 @ giordano-2-2-100-1 (dataPort=36806) > > it tag as unreachable the task manager (who reside on the same node...) > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
This post was updated on .
Moreover, the cpu usage of the other task manager is about 0%.
It seems the problem reside on the job manager which it is not able to distribute the work through the cluster. |
I tried also to set the only job manager on the first node and reconfiguring
the cluster admitting just two task manager. In this way I obtain immediately a NoResourceAvailable error -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Update:
Following other discussions I even tried to reduce memory.fraction to 10% without success. How can I set G1 as garbage collector? the key is env.java.opts but the value? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi,
You can set it to G1GC with the following setting. In my example it is only for the taskmanager, but env.java.opts should work in the same way. env.java.opts.taskmanager: -XX:+UseG1GC |
Thank you, unfortunately it had no effects.
As I add more load on the computation appears the error taskmanager killed on the node on use, without calling other nodes to sustain the computation. I also increased akka.watch.heartbeat.interval akka.watch.heartbeat.pause akka.transport.heartbeat.interval akka.transport.heartbeat.pause obtaining just a (very ) delayed error. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Btw, what load are you putting on the cluster, i.e. what is your computation? If you don't have load, the cluster and job just keep on running, right?
Best, Aljoscha > On 19. Sep 2017, at 12:00, AndreaKinn <[hidden email]> wrote: > > Thank you, unfortunately it had no effects. > > As I add more load on the computation appears the error taskmanager killed > on the node on use, without calling other nodes to sustain the computation. > I also increased > > akka.watch.heartbeat.interval > akka.watch.heartbeat.pause > akka.transport.heartbeat.interval > akka.transport.heartbeat.pause > > obtaining just a (very ) delayed error. > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
the program is composed by:
6 Kafka /source/ connector with custom timestamp and watermark /extractor/ and /map/ function each. then I use 6 instance of an external library called flink-htm (quite heavy) moreover I have 6 /process/ method and 2 /union/ method to merge result streams. Finally I have 2 Cassandra /sinks/. Data which arriving to kafka are 1 kb strings about each 20ms. I'm absolutely sure that the flink-htm library is heavy but I hoped flink managed them distributing the load (which are independent) through the cluster... instead it seems like just one node suffers all the load crashing. If can help I can share my code. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |