TaskManager crashes with PageRank algorithm in Gelly

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

TaskManager crashes with PageRank algorithm in Gelly

santoshg
Folks,

We are running a simple PageRank algorithm in Gelly with about 1M edges and
we are seeing that one the TaskManager just crashes. We suspect it is some
configuration issue because each TaskManager has a total of 136GB memory and
we have 8 of these. So, the total memory is more than enough.

Here is an excerpt from the TaskManager log:

2018-02-21 17:52:24,610 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
--------------------------------------------------------------------------------
2018-02-21 17:52:24,626 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  Starting
TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
2018-02-21 17:52:24,626 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  OS current
user: flink-user
2018-02-21 17:52:24,626 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  Current
Hadoop/Kerberos user: <no hadoop dependency found>
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  JVM:
OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum
heap size: 25400 MiBytes
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME:
/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  No Hadoop
Dependency available
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  JVM
Options:
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Xms25395M
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Xmx25395M
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:MaxDirectMemorySize=8388607T
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:+UseG1GC
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:+PrintSafepointStatistics
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:+HeapDumpOnOutOfMemoryError
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Dlog.file=/home/flink-user/flink-1.4.0/log/flink-flink-user-taskmanager-0-ip-10-10-1-59.log
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Dlog4j.configuration=file:/home/flink-user/flink-1.4.0/conf/log4j.properties
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Dlogback.configurationFile=file:/home/flink-user/flink-1.4.0/conf/logback.xml
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  Program
Arguments:
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
--configDir
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -    
/home/flink-user/flink-1.4.0/conf
2018-02-21 17:52:24,627 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath:
/home/flink-user/flink-1.4.0/lib/flink-gelly_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-gelly-scala_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-presto-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/log4j-1.2.17.jar:/home/flink-user/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/home/flink-user/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar:::
2018-02-21 17:52:24,628 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
--------------------------------------------------------------------------------
2018-02-21 17:52:24,629 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Registered
UNIX signal handlers for [TERM, HUP, INT]
2018-02-21 17:52:24,667 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Maximum
number of open file descriptors is 768000
2018-02-21 17:52:24,728 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Loading
configuration from /home/flink-user/flink-1.4.0/conf
2018-02-21 17:52:24,746 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.address, 10.10.1.242
2018-02-21 17:52:24,746 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.port, 6123
2018-02-21 17:52:24,746 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.heap.mb, 131072
2018-02-21 17:52:24,746 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.heap.mb, 139264
2018-02-21 17:52:24,746 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.numberOfTaskSlots, 64
2018-02-21 17:52:24,747 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.memory.preallocate, false
2018-02-21 17:52:24,747 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.memory.off-heap, true
2018-02-21 17:52:24,747 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.memory.fraction, 0.8
2018-02-21 17:52:24,747 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.network.memory.min, 4294967296
2018-02-21 17:52:24,747 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.network.memory.max, 12884901888
2018-02-21 17:52:24,747 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: parallelism.default, 512
2018-02-21 17:52:24,748 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: web.port, 8081
2018-02-21 17:52:24,748 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.tmp.dirs, /home/flink-user/flink-tmp-dir
2018-02-21 17:52:24,748 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: env.java.home, /usr/lib/jvm/jre-1.8.0-openjdk.x86_64
2018-02-21 17:52:24,749 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: env.java.opts, -XX:+UseG1GC
-XX:+PrintSafepointStatistics -XX:+HeapDumpOnOutOfMemoryError
2018-02-21 17:52:24,749 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: akka.framesize, 201326591b
2018-02-21 17:52:24,749 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: akka.log.lifecycle.events, true
2018-02-21 17:52:24,749 INFO
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: akka.client.timeout, 300 s
2018-02-21 17:52:24,849 INFO  org.apache.flink.core.fs.FileSystem                          
- Hadoop is not in the classpath/dependencies. The extended set of supported
File Systems via Hadoop is not available.
2018-02-21 17:52:24,965 INFO
org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
create Hadoop Security Module because Hadoop cannot be found in the
Classpath.
2018-02-21 17:52:25,188 INFO
org.apache.flink.runtime.security.SecurityUtils               - Cannot
install HadoopSecurityContext because Hadoop cannot be found in the
Classpath.
2018-02-21 17:52:25,347 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
select the network interface and address to use by connecting to the leading
JobManager.
2018-02-21 17:52:25,348 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
will try to connect for 10000 milliseconds before falling back to heuristics
2018-02-21 17:52:25,350 INFO  org.apache.flink.runtime.net.ConnectionUtils                
- Retrieved new target address /10.10.1.242:6123.
2018-02-21 17:52:25,367 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager
will use hostname/address 'ip-10-10-1-59' (10.10.1.59) for communication.
2018-02-21 17:52:25,405 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager
2018-02-21 17:52:25,406 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager actor system at ip-10-10-1-59:40949.
2018-02-21 17:52:25,408 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Trying to
start actor system at ip-10-10-1-59:40949
2018-02-21 17:52:26,493 INFO  akka.event.slf4j.Slf4jLogger                                
- Slf4jLogger started
2018-02-21 17:52:26,553 INFO  akka.remote.Remoting                                        
- Starting remoting
2018-02-21 17:52:27,021 INFO  akka.remote.Remoting                                        
- Remoting started; listening on addresses
:[akka.tcp://flink@ip-10-10-1-59:40949]
2018-02-21 17:52:27,022 INFO  akka.remote.Remoting                                        
- Remoting now listens on addresses: [akka.tcp://flink@ip-10-10-1-59:40949]
2018-02-21 17:52:27,029 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Actor system
started at akka.tcp://flink@ip-10-10-1-59:40949
2018-02-21 17:52:27,067 INFO
org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
reporter configured, no metrics will be exposed/reported.
2018-02-21 17:52:27,084 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager actor


---------------------

Here is the dump from the hs_err_pid file:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing
reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2651), pid=2439, tid=0x00007fc4b7efe700
#
# JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build
1.8.0_161-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64
compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#

---------------  T H R E A D  ---------------

Current thread (0x00007fb5afff8260):


--------------

In the JobManager we see the following:

2018-02-21 17:55:52,380 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to
restart or fail the job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018
(d55f327901087350c24e2a8c34937db1) if no longer possible.
2018-02-21 17:55:52,380 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink
Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1)
switched from state FAILING to FAILED.
java.lang.Exception: The data preparation for task 'Reduce (Sum)' , caused
an error: Error obtaining the sorted input: Thread 'SortMerger Reading
Thread' terminated due to an exception: Connection unexpectedly closed by
remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate
that the remote task manager was lost.
        at
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:466)
        at
org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:145)
        at
org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:93)
        at
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
Thread 'SortMerger Reading Thread' terminated due to an exception:
Connection unexpectedly closed by remote task manager
'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
manager was lost.
        at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
        at
org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1095)
        at
org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:95)
        at
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460)
        ... 5 more
Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
terminated due to an exception: Connection unexpectedly closed by remote
task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the
remote task manager was lost.
        at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connection unexpectedly closed by remote task manager
'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
manager was lost.


-------------

Here are the TaskManager settings:

# The heap size for the TaskManager JVM

taskmanager.heap.mb: 139264


# The number of task slots that each TaskManager offers. Each slot runs one
parallel pipeline.

taskmanager.numberOfTaskSlots: 64

# Specify whether TaskManager memory should be allocated when starting up
(true) or when
# memory is required in the memory manager (false)
# Important Note: For pure streaming setups, we highly recommend to set this
value to `false`
# as the default state backends currently do not use the managed memory.

taskmanager.memory.preallocate: false
taskmanager.memory.off-heap: true
taskmanager.memory.fraction: 0.8

#taskmanager.network.memory.fraction: 0.1
taskmanager.network.memory.min: 4294967296
taskmanager.network.memory.max: 12884901888

#taskmanager.network.numberOfBuffers: 8192
#taskmanager.debug.memory.startLogThread: true
#taskmanager.debug.memory.logIntervalMs: 500

# The parallelism used for programs that did not specify and other
parallelism.

parallelism.default: 512

-----------

So, what are we doing wrong here ?





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: TaskManager crashes with PageRank algorithm in Gelly

santoshg
An update - I was able to overcome these issues by setting the preallocate
flag to true. Not sure why this fixes the problem. Need to dig a bit deeper.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: TaskManager crashes with PageRank algorithm in Gelly

Nico Kruber
Hi,
without knowing Gelly here, maybe it has to do something with cleaning
up the allocated memory as mentioned in [1]:

taskmanager.memory.preallocate: Can be either of true or false.
Specifies whether task managers should allocate all managed memory when
starting up. (DEFAULT: false). When taskmanager.memory.off-heap is set
to true, then it is advised that this configuration is also set to true.
If this configuration is set to false cleaning up of the allocated
offheap memory happens only when the configured JVM parameter
MaxDirectMemorySize is reached by triggering a full GC. Note: For
streaming setups, we highly recommend to set this value to false as the
core state backends currently do not use the managed memory.


Nico

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#managed-memory

On 22/02/18 19:56, santoshg wrote:
> An update - I was able to overcome these issues by setting the preallocate
> flag to true. Not sure why this fixes the problem. Need to dig a bit deeper.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TaskManager crashes with PageRank algorithm in Gelly

Greg Hogan
In reply to this post by santoshg
Termination of the TaskManager by the Linux OOM killer indicates an overallocation of memory and you have set "taskmanager.heap.mb: 139264” on machines with 136 GB.

Even though you were able to (temporarily?) resolve the issue by enabling preallocation, you may see degraded performance if system processes (e.g. prefetch) have no memory to work with.

https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#managed-memory

Greg


> On Feb 21, 2018, at 1:14 PM, santoshg <[hidden email]> wrote:
>
> Folks,
>
> We are running a simple PageRank algorithm in Gelly with about 1M edges and
> we are seeing that one the TaskManager just crashes. We suspect it is some
> configuration issue because each TaskManager has a total of 136GB memory and
> we have 8 of these. So, the total memory is more than enough.
>
> Here is an excerpt from the TaskManager log:
>
> 2018-02-21 17:52:24,610 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -
> --------------------------------------------------------------------------------
> 2018-02-21 17:52:24,626 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  Starting
> TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
> 2018-02-21 17:52:24,626 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  OS current
> user: flink-user
> 2018-02-21 17:52:24,626 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  Current
> Hadoop/Kerberos user: <no hadoop dependency found>
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  JVM:
> OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum
> heap size: 25400 MiBytes
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME:
> /usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  No Hadoop
> Dependency available
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  JVM
> Options:
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Xms25395M
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Xmx25395M
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:MaxDirectMemorySize=8388607T
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:+UseG1GC
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:+PrintSafepointStatistics
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -XX:+HeapDumpOnOutOfMemoryError
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Dlog.file=/home/flink-user/flink-1.4.0/log/flink-flink-user-taskmanager-0-ip-10-10-1-59.log
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Dlog4j.configuration=file:/home/flink-user/flink-1.4.0/conf/log4j.properties
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> -Dlogback.configurationFile=file:/home/flink-user/flink-1.4.0/conf/logback.xml
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  Program
> Arguments:
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> --configDir
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -    
> /home/flink-user/flink-1.4.0/conf
> 2018-02-21 17:52:24,627 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath:
> /home/flink-user/flink-1.4.0/lib/flink-gelly_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-gelly-scala_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-presto-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/log4j-1.2.17.jar:/home/flink-user/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/home/flink-user/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar:::
> 2018-02-21 17:52:24,628 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              -
> --------------------------------------------------------------------------------
> 2018-02-21 17:52:24,629 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Registered
> UNIX signal handlers for [TERM, HUP, INT]
> 2018-02-21 17:52:24,667 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Maximum
> number of open file descriptors is 768000
> 2018-02-21 17:52:24,728 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Loading
> configuration from /home/flink-user/flink-1.4.0/conf
> 2018-02-21 17:52:24,746 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, 10.10.1.242
> 2018-02-21 17:52:24,746 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2018-02-21 17:52:24,746 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.mb, 131072
> 2018-02-21 17:52:24,746 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.heap.mb, 139264
> 2018-02-21 17:52:24,746 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 64
> 2018-02-21 17:52:24,747 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.preallocate, false
> 2018-02-21 17:52:24,747 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.off-heap, true
> 2018-02-21 17:52:24,747 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.fraction, 0.8
> 2018-02-21 17:52:24,747 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.network.memory.min, 4294967296
> 2018-02-21 17:52:24,747 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.network.memory.max, 12884901888
> 2018-02-21 17:52:24,747 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: parallelism.default, 512
> 2018-02-21 17:52:24,748 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: web.port, 8081
> 2018-02-21 17:52:24,748 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.tmp.dirs, /home/flink-user/flink-tmp-dir
> 2018-02-21 17:52:24,748 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: env.java.home, /usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> 2018-02-21 17:52:24,749 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: env.java.opts, -XX:+UseG1GC
> -XX:+PrintSafepointStatistics -XX:+HeapDumpOnOutOfMemoryError
> 2018-02-21 17:52:24,749 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: akka.framesize, 201326591b
> 2018-02-21 17:52:24,749 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: akka.log.lifecycle.events, true
> 2018-02-21 17:52:24,749 INFO
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: akka.client.timeout, 300 s
> 2018-02-21 17:52:24,849 INFO  org.apache.flink.core.fs.FileSystem                          
> - Hadoop is not in the classpath/dependencies. The extended set of supported
> File Systems via Hadoop is not available.
> 2018-02-21 17:52:24,965 INFO
> org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
> create Hadoop Security Module because Hadoop cannot be found in the
> Classpath.
> 2018-02-21 17:52:25,188 INFO
> org.apache.flink.runtime.security.SecurityUtils               - Cannot
> install HadoopSecurityContext because Hadoop cannot be found in the
> Classpath.
> 2018-02-21 17:52:25,347 INFO
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
> select the network interface and address to use by connecting to the leading
> JobManager.
> 2018-02-21 17:52:25,348 INFO
> org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
> will try to connect for 10000 milliseconds before falling back to heuristics
> 2018-02-21 17:52:25,350 INFO  org.apache.flink.runtime.net.ConnectionUtils                
> - Retrieved new target address /10.10.1.242:6123.
> 2018-02-21 17:52:25,367 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager
> will use hostname/address 'ip-10-10-1-59' (10.10.1.59) for communication.
> 2018-02-21 17:52:25,405 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Starting
> TaskManager
> 2018-02-21 17:52:25,406 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Starting
> TaskManager actor system at ip-10-10-1-59:40949.
> 2018-02-21 17:52:25,408 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Trying to
> start actor system at ip-10-10-1-59:40949
> 2018-02-21 17:52:26,493 INFO  akka.event.slf4j.Slf4jLogger                                
> - Slf4jLogger started
> 2018-02-21 17:52:26,553 INFO  akka.remote.Remoting                                        
> - Starting remoting
> 2018-02-21 17:52:27,021 INFO  akka.remote.Remoting                                        
> - Remoting started; listening on addresses
> :[akka.tcp://flink@ip-10-10-1-59:40949]
> 2018-02-21 17:52:27,022 INFO  akka.remote.Remoting                                        
> - Remoting now listens on addresses: [akka.tcp://flink@ip-10-10-1-59:40949]
> 2018-02-21 17:52:27,029 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Actor system
> started at akka.tcp://flink@ip-10-10-1-59:40949
> 2018-02-21 17:52:27,067 INFO
> org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
> reporter configured, no metrics will be exposed/reported.
> 2018-02-21 17:52:27,084 INFO
> org.apache.flink.runtime.taskmanager.TaskManager              - Starting
> TaskManager actor
>
>
> ---------------------
>
> Here is the dump from the hs_err_pid file:
>
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 12288 bytes for committing
> reserved memory.
> # Possible reasons:
> #   The system is out of physical RAM or swap space
> #   In 32 bit mode, the process size limit was hit
> # Possible solutions:
> #   Reduce memory load on the system
> #   Increase physical memory or swap space
> #   Check if swap backing store is full
> #   Use 64 bit Java on a 64 bit OS
> #   Decrease Java heap size (-Xmx/-Xms)
> #   Decrease number of Java threads
> #   Decrease Java thread stack sizes (-Xss)
> #   Set larger code cache with -XX:ReservedCodeCacheSize=
> # This output file may be truncated or incomplete.
> #
> #  Out of Memory Error (os_linux.cpp:2651), pid=2439, tid=0x00007fc4b7efe700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build
> 1.8.0_161-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64
> compressed oops)
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
>
> ---------------  T H R E A D  ---------------
>
> Current thread (0x00007fb5afff8260):
>
>
> --------------
>
> In the JobManager we see the following:
>
> 2018-02-21 17:55:52,380 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to
> restart or fail the job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018
> (d55f327901087350c24e2a8c34937db1) if no longer possible.
> 2018-02-21 17:55:52,380 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink
> Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1)
> switched from state FAILING to FAILED.
> java.lang.Exception: The data preparation for task 'Reduce (Sum)' , caused
> an error: Error obtaining the sorted input: Thread 'SortMerger Reading
> Thread' terminated due to an exception: Connection unexpectedly closed by
> remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate
> that the remote task manager was lost.
>        at
> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:466)
>        at
> org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:145)
>        at
> org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:93)
>        at
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
>        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>        at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
> Thread 'SortMerger Reading Thread' terminated due to an exception:
> Connection unexpectedly closed by remote task manager
> 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
> manager was lost.
>        at
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
>        at
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1095)
>        at
> org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:95)
>        at
> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460)
>        ... 5 more
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
> terminated due to an exception: Connection unexpectedly closed by remote
> task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the
> remote task manager was lost.
>        at
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
> Caused by:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager
> 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
> manager was lost.
>
>
> -------------
>
> Here are the TaskManager settings:
>
> # The heap size for the TaskManager JVM
>
> taskmanager.heap.mb: 139264
>
>
> # The number of task slots that each TaskManager offers. Each slot runs one
> parallel pipeline.
>
> taskmanager.numberOfTaskSlots: 64
>
> # Specify whether TaskManager memory should be allocated when starting up
> (true) or when
> # memory is required in the memory manager (false)
> # Important Note: For pure streaming setups, we highly recommend to set this
> value to `false`
> # as the default state backends currently do not use the managed memory.
>
> taskmanager.memory.preallocate: false
> taskmanager.memory.off-heap: true
> taskmanager.memory.fraction: 0.8
>
> #taskmanager.network.memory.fraction: 0.1
> taskmanager.network.memory.min: 4294967296
> taskmanager.network.memory.max: 12884901888
>
> #taskmanager.network.numberOfBuffers: 8192
> #taskmanager.debug.memory.startLogThread: true
> #taskmanager.debug.memory.logIntervalMs: 500
>
> # The parallelism used for programs that did not specify and other
> parallelism.
>
> parallelism.default: 512
>
> -----------
>
> So, what are we doing wrong here ?
>
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/