Folks,
We are running a simple PageRank algorithm in Gelly with about 1M edges and we are seeing that one the TaskManager just crashes. We suspect it is some configuration issue because each TaskManager has a total of 136GB memory and we have 8 of these. So, the total memory is more than enough. Here is an excerpt from the TaskManager log: 2018-02-21 17:52:24,610 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 2018-02-21 17:52:24,626 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC) 2018-02-21 17:52:24,626 INFO org.apache.flink.runtime.taskmanager.TaskManager - OS current user: flink-user 2018-02-21 17:52:24,626 INFO org.apache.flink.runtime.taskmanager.TaskManager - Current Hadoop/Kerberos user: <no hadoop dependency found> 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum heap size: 25400 MiBytes 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - JAVA_HOME: /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - No Hadoop Dependency available 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM Options: 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xms25395M 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xmx25395M 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxDirectMemorySize=8388607T 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+UseG1GC 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+PrintSafepointStatistics 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+HeapDumpOnOutOfMemoryError 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog.file=/home/flink-user/flink-1.4.0/log/flink-flink-user-taskmanager-0-ip-10-10-1-59.log 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog4j.configuration=file:/home/flink-user/flink-1.4.0/conf/log4j.properties 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlogback.configurationFile=file:/home/flink-user/flink-1.4.0/conf/logback.xml 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - Program Arguments: 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - --configDir 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - /home/flink-user/flink-1.4.0/conf 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - Classpath: /home/flink-user/flink-1.4.0/lib/flink-gelly_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-gelly-scala_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-presto-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/log4j-1.2.17.jar:/home/flink-user/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/home/flink-user/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar::: 2018-02-21 17:52:24,628 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 2018-02-21 17:52:24,629 INFO org.apache.flink.runtime.taskmanager.TaskManager - Registered UNIX signal handlers for [TERM, HUP, INT] 2018-02-21 17:52:24,667 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum number of open file descriptors is 768000 2018-02-21 17:52:24,728 INFO org.apache.flink.runtime.taskmanager.TaskManager - Loading configuration from /home/flink-user/flink-1.4.0/conf 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, 10.10.1.242 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 131072 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 139264 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 64 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.off-heap, true 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.fraction, 0.8 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.network.memory.min, 4294967296 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.network.memory.max, 12884901888 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 512 2018-02-21 17:52:24,748 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.port, 8081 2018-02-21 17:52:24,748 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.tmp.dirs, /home/flink-user/flink-tmp-dir 2018-02-21 17:52:24,748 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.opts, -XX:+UseG1GC -XX:+PrintSafepointStatistics -XX:+HeapDumpOnOutOfMemoryError 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.framesize, 201326591b 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.log.lifecycle.events, true 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.client.timeout, 300 s 2018-02-21 17:52:24,849 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available. 2018-02-21 17:52:24,965 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath. 2018-02-21 17:52:25,188 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath. 2018-02-21 17:52:25,347 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager. 2018-02-21 17:52:25,348 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2018-02-21 17:52:25,350 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /10.10.1.242:6123. 2018-02-21 17:52:25,367 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'ip-10-10-1-59' (10.10.1.59) for communication. 2018-02-21 17:52:25,405 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager 2018-02-21 17:52:25,406 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at ip-10-10-1-59:40949. 2018-02-21 17:52:25,408 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to start actor system at ip-10-10-1-59:40949 2018-02-21 17:52:26,493 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2018-02-21 17:52:26,553 INFO akka.remote.Remoting - Starting remoting 2018-02-21 17:52:27,021 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink@ip-10-10-1-59:40949] 2018-02-21 17:52:27,022 INFO akka.remote.Remoting - Remoting now listens on addresses: [akka.tcp://flink@ip-10-10-1-59:40949] 2018-02-21 17:52:27,029 INFO org.apache.flink.runtime.taskmanager.TaskManager - Actor system started at akka.tcp://flink@ip-10-10-1-59:40949 2018-02-21 17:52:27,067 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported. 2018-02-21 17:52:27,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor --------------------- Here is the dump from the hs_err_pid file: # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (os_linux.cpp:2651), pid=2439, tid=0x00007fc4b7efe700 # # JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build 1.8.0_161-b14) # Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # --------------- T H R E A D --------------- Current thread (0x00007fb5afff8260): -------------- In the JobManager we see the following: 2018-02-21 17:55:52,380 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to restart or fail the job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1) if no longer possible. 2018-02-21 17:55:52,380 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1) switched from state FAILING to FAILED. java.lang.Exception: The data preparation for task 'Reduce (Sum)' , caused an error: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:466) at org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:145) at org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:93) at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1095) at org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:95) at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460) ... 5 more Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. ------------- Here are the TaskManager settings: # The heap size for the TaskManager JVM taskmanager.heap.mb: 139264 # The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline. taskmanager.numberOfTaskSlots: 64 # Specify whether TaskManager memory should be allocated when starting up (true) or when # memory is required in the memory manager (false) # Important Note: For pure streaming setups, we highly recommend to set this value to `false` # as the default state backends currently do not use the managed memory. taskmanager.memory.preallocate: false taskmanager.memory.off-heap: true taskmanager.memory.fraction: 0.8 #taskmanager.network.memory.fraction: 0.1 taskmanager.network.memory.min: 4294967296 taskmanager.network.memory.max: 12884901888 #taskmanager.network.numberOfBuffers: 8192 #taskmanager.debug.memory.startLogThread: true #taskmanager.debug.memory.logIntervalMs: 500 # The parallelism used for programs that did not specify and other parallelism. parallelism.default: 512 ----------- So, what are we doing wrong here ? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
An update - I was able to overcome these issues by setting the preallocate
flag to true. Not sure why this fixes the problem. Need to dig a bit deeper. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi,
without knowing Gelly here, maybe it has to do something with cleaning up the allocated memory as mentioned in [1]: taskmanager.memory.preallocate: Can be either of true or false. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When taskmanager.memory.off-heap is set to true, then it is advised that this configuration is also set to true. If this configuration is set to false cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. Note: For streaming setups, we highly recommend to set this value to false as the core state backends currently do not use the managed memory. Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#managed-memory On 22/02/18 19:56, santoshg wrote: > An update - I was able to overcome these issues by setting the preallocate > flag to true. Not sure why this fixes the problem. Need to dig a bit deeper. > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > signature.asc (201 bytes) Download Attachment |
In reply to this post by santoshg
Termination of the TaskManager by the Linux OOM killer indicates an overallocation of memory and you have set "taskmanager.heap.mb: 139264” on machines with 136 GB.
Even though you were able to (temporarily?) resolve the issue by enabling preallocation, you may see degraded performance if system processes (e.g. prefetch) have no memory to work with. https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#managed-memory Greg > On Feb 21, 2018, at 1:14 PM, santoshg <[hidden email]> wrote: > > Folks, > > We are running a simple PageRank algorithm in Gelly with about 1M edges and > we are seeing that one the TaskManager just crashes. We suspect it is some > configuration issue because each TaskManager has a total of 136GB memory and > we have 8 of these. So, the total memory is more than enough. > > Here is an excerpt from the TaskManager log: > > 2018-02-21 17:52:24,610 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -------------------------------------------------------------------------------- > 2018-02-21 17:52:24,626 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Starting > TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC) > 2018-02-21 17:52:24,626 INFO > org.apache.flink.runtime.taskmanager.TaskManager - OS current > user: flink-user > 2018-02-21 17:52:24,626 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Current > Hadoop/Kerberos user: <no hadoop dependency found> > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - JVM: > OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14 > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Maximum > heap size: 25400 MiBytes > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - JAVA_HOME: > /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - No Hadoop > Dependency available > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - JVM > Options: > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -Xms25395M > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -Xmx25395M > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -XX:MaxDirectMemorySize=8388607T > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -XX:+UseG1GC > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -XX:+PrintSafepointStatistics > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -XX:+HeapDumpOnOutOfMemoryError > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -Dlog.file=/home/flink-user/flink-1.4.0/log/flink-flink-user-taskmanager-0-ip-10-10-1-59.log > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -Dlog4j.configuration=file:/home/flink-user/flink-1.4.0/conf/log4j.properties > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -Dlogback.configurationFile=file:/home/flink-user/flink-1.4.0/conf/logback.xml > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Program > Arguments: > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > --configDir > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > /home/flink-user/flink-1.4.0/conf > 2018-02-21 17:52:24,627 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Classpath: > /home/flink-user/flink-1.4.0/lib/flink-gelly_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-gelly-scala_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-presto-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/log4j-1.2.17.jar:/home/flink-user/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/home/flink-user/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar::: > 2018-02-21 17:52:24,628 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > -------------------------------------------------------------------------------- > 2018-02-21 17:52:24,629 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Registered > UNIX signal handlers for [TERM, HUP, INT] > 2018-02-21 17:52:24,667 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Maximum > number of open file descriptors is 768000 > 2018-02-21 17:52:24,728 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Loading > configuration from /home/flink-user/flink-1.4.0/conf > 2018-02-21 17:52:24,746 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, 10.10.1.242 > 2018-02-21 17:52:24,746 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2018-02-21 17:52:24,746 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.mb, 131072 > 2018-02-21 17:52:24,746 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.heap.mb, 139264 > 2018-02-21 17:52:24,746 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 64 > 2018-02-21 17:52:24,747 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.preallocate, false > 2018-02-21 17:52:24,747 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.off-heap, true > 2018-02-21 17:52:24,747 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.fraction, 0.8 > 2018-02-21 17:52:24,747 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.network.memory.min, 4294967296 > 2018-02-21 17:52:24,747 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.network.memory.max, 12884901888 > 2018-02-21 17:52:24,747 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: parallelism.default, 512 > 2018-02-21 17:52:24,748 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: web.port, 8081 > 2018-02-21 17:52:24,748 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.tmp.dirs, /home/flink-user/flink-tmp-dir > 2018-02-21 17:52:24,748 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: env.java.home, /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 > 2018-02-21 17:52:24,749 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: env.java.opts, -XX:+UseG1GC > -XX:+PrintSafepointStatistics -XX:+HeapDumpOnOutOfMemoryError > 2018-02-21 17:52:24,749 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: akka.framesize, 201326591b > 2018-02-21 17:52:24,749 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: akka.log.lifecycle.events, true > 2018-02-21 17:52:24,749 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: akka.client.timeout, 300 s > 2018-02-21 17:52:24,849 INFO org.apache.flink.core.fs.FileSystem > - Hadoop is not in the classpath/dependencies. The extended set of supported > File Systems via Hadoop is not available. > 2018-02-21 17:52:24,965 INFO > org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot > create Hadoop Security Module because Hadoop cannot be found in the > Classpath. > 2018-02-21 17:52:25,188 INFO > org.apache.flink.runtime.security.SecurityUtils - Cannot > install HadoopSecurityContext because Hadoop cannot be found in the > Classpath. > 2018-02-21 17:52:25,347 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to > select the network interface and address to use by connecting to the leading > JobManager. > 2018-02-21 17:52:25,348 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager > will try to connect for 10000 milliseconds before falling back to heuristics > 2018-02-21 17:52:25,350 INFO org.apache.flink.runtime.net.ConnectionUtils > - Retrieved new target address /10.10.1.242:6123. > 2018-02-21 17:52:25,367 INFO > org.apache.flink.runtime.taskmanager.TaskManager - TaskManager > will use hostname/address 'ip-10-10-1-59' (10.10.1.59) for communication. > 2018-02-21 17:52:25,405 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Starting > TaskManager > 2018-02-21 17:52:25,406 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Starting > TaskManager actor system at ip-10-10-1-59:40949. > 2018-02-21 17:52:25,408 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Trying to > start actor system at ip-10-10-1-59:40949 > 2018-02-21 17:52:26,493 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 2018-02-21 17:52:26,553 INFO akka.remote.Remoting > - Starting remoting > 2018-02-21 17:52:27,021 INFO akka.remote.Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink@ip-10-10-1-59:40949] > 2018-02-21 17:52:27,022 INFO akka.remote.Remoting > - Remoting now listens on addresses: [akka.tcp://flink@ip-10-10-1-59:40949] > 2018-02-21 17:52:27,029 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Actor system > started at akka.tcp://flink@ip-10-10-1-59:40949 > 2018-02-21 17:52:27,067 INFO > org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics > reporter configured, no metrics will be exposed/reported. > 2018-02-21 17:52:27,084 INFO > org.apache.flink.runtime.taskmanager.TaskManager - Starting > TaskManager actor > > > --------------------- > > Here is the dump from the hs_err_pid file: > > # > # There is insufficient memory for the Java Runtime Environment to continue. > # Native memory allocation (mmap) failed to map 12288 bytes for committing > reserved memory. > # Possible reasons: > # The system is out of physical RAM or swap space > # In 32 bit mode, the process size limit was hit > # Possible solutions: > # Reduce memory load on the system > # Increase physical memory or swap space > # Check if swap backing store is full > # Use 64 bit Java on a 64 bit OS > # Decrease Java heap size (-Xmx/-Xms) > # Decrease number of Java threads > # Decrease Java thread stack sizes (-Xss) > # Set larger code cache with -XX:ReservedCodeCacheSize= > # This output file may be truncated or incomplete. > # > # Out of Memory Error (os_linux.cpp:2651), pid=2439, tid=0x00007fc4b7efe700 > # > # JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build > 1.8.0_161-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64 > compressed oops) > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > > --------------- T H R E A D --------------- > > Current thread (0x00007fb5afff8260): > > > -------------- > > In the JobManager we see the following: > > 2018-02-21 17:55:52,380 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to > restart or fail the job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018 > (d55f327901087350c24e2a8c34937db1) if no longer possible. > 2018-02-21 17:55:52,380 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink > Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1) > switched from state FAILING to FAILED. > java.lang.Exception: The data preparation for task 'Reduce (Sum)' , caused > an error: Error obtaining the sorted input: Thread 'SortMerger Reading > Thread' terminated due to an exception: Connection unexpectedly closed by > remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate > that the remote task manager was lost. > at > org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:466) > at > org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:145) > at > org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:93) > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: Error obtaining the sorted input: > Thread 'SortMerger Reading Thread' terminated due to an exception: > Connection unexpectedly closed by remote task manager > 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task > manager was lost. > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1095) > at > org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:95) > at > org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460) > ... 5 more > Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' > terminated due to an exception: Connection unexpectedly closed by remote > task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the > remote task manager was lost. > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) > Caused by: > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager > 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task > manager was lost. > > > ------------- > > Here are the TaskManager settings: > > # The heap size for the TaskManager JVM > > taskmanager.heap.mb: 139264 > > > # The number of task slots that each TaskManager offers. Each slot runs one > parallel pipeline. > > taskmanager.numberOfTaskSlots: 64 > > # Specify whether TaskManager memory should be allocated when starting up > (true) or when > # memory is required in the memory manager (false) > # Important Note: For pure streaming setups, we highly recommend to set this > value to `false` > # as the default state backends currently do not use the managed memory. > > taskmanager.memory.preallocate: false > taskmanager.memory.off-heap: true > taskmanager.memory.fraction: 0.8 > > #taskmanager.network.memory.fraction: 0.1 > taskmanager.network.memory.min: 4294967296 > taskmanager.network.memory.max: 12884901888 > > #taskmanager.network.numberOfBuffers: 8192 > #taskmanager.debug.memory.startLogThread: true > #taskmanager.debug.memory.logIntervalMs: 500 > > # The parallelism used for programs that did not specify and other > parallelism. > > parallelism.default: 512 > > ----------- > > So, what are we doing wrong here ? > > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |