Hello all,
I have an algorithm x () which contains several joins and usage of 3 times of gelly ConnectedComponents. The problem is that if I call x() inside a script more than three times, I receive the messages listed below in the log and the program is somehow stopped. It happens even if I run it with a toy example of a graph with less that 10 vertices. Do you have any clue what is the problem? Cheers, Alieh 129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request. 129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request. 129150 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat request from e80ec35f3d0a04a68000ecbdc555f98b. 129150 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db. 129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received new slot report from TaskManager 78cdd7a4-0c00-4912-992f-a2990a5d46db. 129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da. 129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle timeout. 129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped. |
Hi,
Please investigate logs/standard output/error from the task manager that has failed (the logs that you showed are from job manager). Probably there is some obvious error/exception explaining why has it failed. Most common reasons: - out of memory - long GC pause - seg fault or other error from some native library - task manager killed via for example SIGKILL Piotrek > On 6 Dec 2018, at 17:34, Alieh <[hidden email]> wrote: > > Hello all, > > I have an algorithm x () which contains several joins and usage of 3 times of gelly ConnectedComponents. The problem is that if I call x() inside a script more than three times, I receive the messages listed below in the log and the program is somehow stopped. It happens even if I run it with a toy example of a graph with less that 10 vertices. Do you have any clue what is the problem? > > Cheers, > > Alieh > > > 129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request. > 129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request. > 129150 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat request from e80ec35f3d0a04a68000ecbdc555f98b. > 129150 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db. > 129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received new slot report from TaskManager 78cdd7a4-0c00-4912-992f-a2990a5d46db. > 129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da. > 129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle timeout. > 129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped. > |
Hello Piotrek, thank you for your answer. I installed a Flink on a local cluster
and used the GUI in order to monitor the task managers. It seems
the program does not start at all. The whole time
just the job manager is struggling... For very very toy examples,
after a long time (during this time I see the job manager logs as
I mentioned before), the job is started and can be executed in 2
seconds. Best, Alieh On 12/07/2018 10:43 AM, Piotr Nowojski
wrote:
Hi, Please investigate logs/standard output/error from the task manager that has failed (the logs that you showed are from job manager). Probably there is some obvious error/exception explaining why has it failed. Most common reasons: - out of memory - long GC pause - seg fault or other error from some native library - task manager killed via for example SIGKILL PiotrekOn 6 Dec 2018, at 17:34, Alieh [hidden email] wrote: Hello all, I have an algorithm x () which contains several joins and usage of 3 times of gelly ConnectedComponents. The problem is that if I call x() inside a script more than three times, I receive the messages listed below in the log and the program is somehow stopped. It happens even if I run it with a toy example of a graph with less that 10 vertices. Do you have any clue what is the problem? Cheers, Alieh 129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request. 129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request. 129150 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat request from e80ec35f3d0a04a68000ecbdc555f98b. 129150 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db. 129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received new slot report from TaskManager 78cdd7a4-0c00-4912-992f-a2990a5d46db. 129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da. 129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle timeout. 129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped. |
Hi,
Have you checked task managers logs? Piotrek
|
In reply to this post by Alieh
Hello, this is the task manage log but it does not change after I run the program. I think the Flink planner has problem with my program. It can not even start the job. Best, Alieh
018-12-10 12:20:20,386 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -------------------------------------------------------------------------------- 2018-12-10 12:20:20,387 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 @ 13:31:13 UTC) 2018-12-10 12:20:20,387 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: alieh 2018-12-10 12:20:20,609 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-12-10 12:20:20,768 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user: alieh 2018-12-10 12:20:20,769 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b12 2018-12-10 12:20:20,769 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 922 MiBytes 2018-12-10 12:20:20,769 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: /usr/lib/jvm/java-8-oracle 2018-12-10 12:20:20,774 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Hadoop version: 2.4.1 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options: 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=8388607T 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog.file=/home/alieh/flink-1.6.0/log/flink-alieh-taskexecutor-0-alieh-P67A-D3-B3.log 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/home/alieh/flink-1.6.0/conf/log4j.properties 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/home/alieh/flink-1.6.0/conf/logback.xml 2018-12-10 12:20:20,775 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments: 2018-12-10 12:20:20,776 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --configDir 2018-12-10 12:20:20,776 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - /home/alieh/flink-1.6.0/conf 2018-12-10 12:20:20,776 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Classpath: /home/alieh/flink-1.6.0/lib/flink-python_2.11-1.6.0.jar:/home/alieh/flink-1.6.0/lib/flink-shaded-hadoop2-uber-1.6.0.jar:/home/alieh/flink-1.6.0/lib/log4j-1.2.17.jar:/home/alieh/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar:/home/alieh/flink-1.6.0/lib/flink-dist_2.11-1.6.0.jar::: 2018-12-10 12:20:20,776 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -------------------------------------------------------------------------------- 2018-12-10 12:20:20,777 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered UNIX signal handlers for [TERM, HUP, INT] 2018-12-10 12:20:20,785 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum number of open file descriptors is 1048576. 2018-12-10 12:20:20,803 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost 2018-12-10 12:20:20,803 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2018-12-10 12:20:20,803 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m 2018-12-10 12:20:20,803 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m 2018-12-10 12:20:20,803 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1 2018-12-10 12:20:20,803 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1 2018-12-10 12:20:20,804 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081 2018-12-10 12:20:20,912 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to alieh (auth:SIMPLE) 2018-12-10 12:20:21,131 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address' 2018-12-10 12:20:21,135 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager. 2018-12-10 12:20:21,136 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2018-12-10 12:20:21,145 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address localhost/127.0.0.1:6123. 2018-12-10 12:20:21,204 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager will use hostname/address 'alieh-P67A-D3-B3' (127.0.1.1) for communication. 2018-12-10 12:20:21,208 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Starting AkkaRpcService at alieh-p67a-d3-b3:0. 2018-12-10 12:20:21,805 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2018-12-10 12:20:21,898 INFO akka.remote.Remoting - Starting remoting 2018-12-10 12:20:22,091 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink@alieh-p67a-d3-b3:44267] 2018-12-10 12:20:22,117 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported. 2018-12-10 12:20:22,124 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /tmp/blobStore-32ec7a05-737e-4b46-b716-3a0831683c47 2018-12-10 12:20:22,127 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /tmp/blobStore-4b33c843-b7d3-45dc-814f-850e8c6be21a 2018-12-10 12:20:22,136 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: alieh-P67A-D3-B3/127.0.1.1, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 2018-12-10 12:20:22,166 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 450 GB, usable 91 GB (20.22% usable) 2018-12-10 12:20:22,211 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 102 MB for network buffer pool (number of memory segments: 3278, bytes per segment: 32768). 2018-12-10 12:20:22,256 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder. 2018-12-10 12:20:22,256 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder. 2018-12-10 12:20:22,257 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components. 2018-12-10 12:20:22,289 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 31 ms). 2018-12-10 12:20:22,325 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 35 ms). Listening on SocketAddress /127.0.1.1:46127. 2018-12-10 12:20:22,326 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily. 2018-12-10 12:20:22,329 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-4f10dc60-3805-4c50-85a1-497c99dfb20c for spill files. 2018-12-10 12:20:22,387 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms 2018-12-10 12:20:22,394 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 . 2018-12-10 12:20:22,406 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job leader service. 2018-12-10 12:20:22,407 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink@localhost:6123/user/resourcemanager(00000000000000000000000000000000). 2018-12-10 12:20:22,409 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-058052c5-36cc-432f-88eb-8acf7dc5f1f1 2018-12-10 12:20:22,743 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Resolved ResourceManager address, beginning registration 2018-12-10 12:20:22,743 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager attempt 1 (timeout=100ms) 2018-12-10 12:20:22,814 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Successful registration at resource manager akka.tcp://flink@localhost:6123/user/resourcemanager under registration id ba9dd638db7ebccde63a3e0df420a990. On 12/10/2018 12:14 PM, Piotr
Nowojski wrote:
Hi, |
Hey,
Is that whole Task Manager log? Have you checked memory issues both on Task Managers and the Job Manager? Like out of memory/long GC pauses as I suggested in the first email? After you rule memory issues, you could capture couple of thread dumps (`kill -3 JVM_PID` or `jstack JVM_PID`) and check if any thread is stuck in your code. Another potential issue, are you sure that you have a healthy network between nodes? No packet losts, low ping etc? Piotrek
|
Free forum by Nabble | Edit this page |