"Memory ran out" error when running connected components

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

"Memory ran out" error when running connected components

Arkay
Hi to all,

I’m aware there are a few threads on this, but I haven’t been able to solve an issue I am seeing and hoped someone can help.  I’m trying to run the following:

val connectedNetwork = new org.apache.flink.api.scala.DataSet[Vertex[Long, Long]](
  Graph.fromTuple2DataSet(inputEdges, vertexInitialiser, env)
    .run(new ConnectedComponents[Long, NullValue](100)))

And hitting the error:

java.lang.RuntimeException: Memory ran out. numPartitions: 32 minPartition: 8 maxPartition: 8 number of overflow segments: 122 bucketSize: 206 Overall memory: 19365888 Partition memory: 8388608
         at org.apache.flink.runtime.operators.hash.CompactingHashTable.getNextBuffer(CompactingHashTable.java:753)
         at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertBucketEntryFromStart(CompactingHashTable.java:546)
         at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:423)
         at org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
         at org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
         at org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
         at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
         at java.lang.Thread.run(Unknown Source)

I’m running Flink 1.0.3 on windows 10 using start-local.bat.  I have Xmx set to 6500MB, 8 workers, parallelism 8 and other memory settings left at default.

The inputEdges dataset contains 141MB of Long,Long pairs (which is around 6 million edges).  ParentID is unique and always negative, ChildID is non-unique and always positive (simulating a bipartite graph)

An example few rows:
-91498683401,1738
-135344401,5370
-100260517801,7970
-154352186001,12311
-160265532002,12826

The vast majority of the childIds are actually unique, and the most popular ID only occurs 10 times.

VertexInitialiser just sets the vertex value to the id.

Hopefully this is just a memory setting I’m not seeing for the hashTable as it dies almost instantly,  I don’t think it gets very far into the dataset.  I understand that the CompactingHashTable cannot spill, but I’d be surprised if it needed to at these low volumes.

Many thanks for any help!

Rob
Reply | Threaded
Open this post in threaded view
|

Re: "Memory ran out" error when running connected components

Vasiliki Kalavri
Hi Rob,


On 13 May 2016 at 11:22, Arkay <[hidden email]> wrote:
Hi to all,

I’m aware there are a few threads on this, but I haven’t been able to solve
an issue I am seeing and hoped someone can help.  I’m trying to run the
following:

val connectedNetwork = new org.apache.flink.api.scala.DataSet[Vertex[Long,
Long]](
  Graph.fromTuple2DataSet(inputEdges, vertexInitialiser, env)
    .run(new ConnectedComponents[Long, NullValue](100)))

And hitting the error:

java.lang.RuntimeException: Memory ran out. numPartitions: 32 minPartition:
8 maxPartition: 8 number of overflow segments: 122 bucketSize: 206 Overall
memory: 19365888 Partition memory: 8388608
         at
org.apache.flink.runtime.operators.hash.CompactingHashTable.getNextBuffer(CompactingHashTable.java:753)
         at
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertBucketEntryFromStart(CompactingHashTable.java:546)
         at
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:423)
         at
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
         at
org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
         at
org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
         at
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
         at java.lang.Thread.run(Unknown Source)

I’m running Flink 1.0.3 on windows 10 using start-local.bat.  I have Xmx set
to 6500MB, 8 workers, parallelism 8 and other memory settings left at
default.

The ​start-local script will start a single JobManager and TaskManager. ​What do you mean by 8 workers? Have you set the numberOfTaskSlots to 8? To give all available memory to your TaskManager, you should set the "taskmanager.heap.mb" configuration option in flink-conf.yaml. Can you open the Flink dashboard at http://localhost:8081/ and check the configuration of your taskmanager?

​Cheers,
-Vasia.​


The inputEdges dataset contains 141MB of Long,Long pairs (which is around 6
million edges).  ParentID is unique and always negative, ChildID is
non-unique and always positive (simulating a bipartite graph)

An example few rows:
-91498683401,1738
-135344401,5370
-100260517801,7970
-154352186001,12311
-160265532002,12826

The vast majority of the childIds are actually unique, and the most popular
ID only occurs 10 times.

VertexInitialiser just sets the vertex value to the id.

Hopefully this is just a memory setting I’m not seeing for the hashTable as
it dies almost instantly,  I don’t think it gets very far into the dataset.
I understand that the CompactingHashTable cannot spill, but I’d be surprised
if it needed to at these low volumes.

Many thanks for any help!

Rob




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-ran-out-error-when-running-connected-components-tp6888.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: "Memory ran out" error when running connected components

Arkay
Thanks Vasia,

Apologies, yes by workers i mean I have set taskmanager.numberOfTaskSlots: 8 and parallelism.default: 8 in flink-conf.yaml. I have also set taskmanager.heap.mb: 6500

In the dashboard it is showing free memory as 5.64GB and Flink Managed Memory as 3.90GB.

Thanks,
Rob
Reply | Threaded
Open this post in threaded view
|

Re: "Memory ran out" error when running connected components

Vasiliki Kalavri
Thanks for checking Rob! I don't see any reason for the job to fail with this configuration and input size.
I have no experience running Flink on windows though, so I might be missing something. Do you get a similar error with smaller inputs?

-Vasia.

On 13 May 2016 at 13:27, Arkay <[hidden email]> wrote:
Thanks Vasia,

Apologies, yes by workers i mean I have set taskmanager.numberOfTaskSlots: 8
and parallelism.default: 8 in flink-conf.yaml. I have also set
taskmanager.heap.mb: 6500

In the dashboard it is showing free memory as 5.64GB and Flink Managed
Memory as 3.90GB.

Thanks,
Rob



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-ran-out-error-when-running-connected-components-tp6888p6895.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: "Memory ran out" error when running connected components

Arkay
This post was updated on .
Hi Vasia,

It seems to work OK up to about 50MB of input, and dies after that point.  If i disable just this connected components step the rest of my program is happy with the full 1.5GB test dataset.  It seems to be specifically limited to GraphAlgorithms in my case.

Do you know what the units are when it is saying Partition memory: 8388608?  If it is bytes then it sounds like its using around 256MB per hash table of 32 partitions (which is then multiplied by number of task slots i guess).  Can this number be configured do you know?  Perhaps the windows version of the JVM is defaulting it to a lower value than on Linux?

Edit: I've noticed halving the parallelism to 4 more than doubles the partition memory to 18513920 but it still fails instantly.

Thanks,
Rob
Reply | Threaded
Open this post in threaded view
|

Re: "Memory ran out" error when running connected components

Vasiliki Kalavri


On 13 May 2016 at 14:28, Arkay <[hidden email]> wrote:
Hi Vasia,

It seems to work OK up to about 50MB of input, and dies after that point.
If i disable just this connected components step the rest of my program is
happy with the full 1.5GB test dataset.  It seems to be specifically limited
to GraphAlgorithms in my case.

​So your program has other ​steps before/after the connected components algorithm?
Could it be that you have some expensive operation that competes for memory with the hash table?

 

Do you know what the units are when it is saying Partition memory: 8388608?
If it is bytes then it sounds like its using around 256MB per hash table of
32 partitions (which is then multiplied by number of task slots i guess).

​Yes, that's bytes.​

 
Can this number be configured do you know?  Perhaps the windows version of
the JVM is defaulting it to a lower value than on Linux?

​By default, the hash table uses Fink's managed memory. That's 3.0GB in your case (0.7 of the total memory by default).
You can change this fraction by setting the "taskmanager.memory.fraction" in the configuration. See [1] for other managed memory options.

Hope this helps!
-Vasia.


 

Thanks,
Rob



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-ran-out-error-when-running-connected-components-tp6888p6899.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: "Memory ran out" error when running connected components

Arkay
Thanks for the link, I had experimented with those options, apart from taskmanager.memory.off-heap: true.  Turns out that allows it to run through happily!  I don't know if that is a peculiarity of a windows JVM, as I understand that setting is purely an efficiency improvement?

For your first question, yes I have a number of steps that get scheduled around the same time in the job, its not really avoidable unless there are optimizer hints to tell the system to only run certain steps on their own?  I will try cutting the rest of the program out as a test however.

Thanks very much for your help with this, and all your excellent work on Flink and Gelly :)

Rob
Reply | Threaded
Open this post in threaded view
|

Re: "Memory ran out" error when running connected components

Vasiliki Kalavri
Hey Rob,

On 13 May 2016 at 15:45, Arkay <[hidden email]> wrote:
Thanks for the link, I had experimented with those options, apart from
taskmanager.memory.off-heap: true.  Turns out that allows it to run through
happily!  I don't know if that is a peculiarity of a windows JVM, as I
understand that setting is purely an efficiency improvement?

​Great to hear that you solved your problem!
​I'm not sure whether it's a windows peculiarity, maybe someone else could clear this up.​

 

For your first question, yes I have a number of steps that get scheduled
around the same time in the job, its not really avoidable unless there are
optimizer hints to tell the system to only run certain steps on their own?

​You could try setting the execution mode ​to BATCH/BATCH_FORCED with "env.getConfig.setExecutionMode()".
It is typically more expensive than the default pipelined mode, but it guarantees that no successive operations are run concurrently.

 
I will try cutting the rest of the program out as a test however.

Thanks very much for your help with this, and all your excellent work on
Flink and Gelly :)


​:))​
​Let us know if you run into any more problems or have questions​.

​Cheers,
-V.​


Rob



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-ran-out-error-when-running-connected-components-tp6888p6904.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.