Hi All, We're running into a memory management issue when using the iterateWithTermination function. Using a small amount of data, everything works perfectly fine. However, as soon as the main memory is filled up on a worker, nothing seems to be happening any more. Once this happens, any worker whose memory is full will have its CPU workload drop to a minimum (<5%), while maintaining a full memory with no apparent garbage collection happening and thus the memory remaining full. All Tasks within this iteration are set to started, yet none of them actually do anything measurable. While runs with slightly less data (so that all intermediate results barely fit into main memory) finished within minutes, runs where the data would no longer fit would run for days with no results in sight. When using fewer workers or even running the algorithm locally, this issue already appears when using less data, which the larger cluster (with more combined memory) could still handle. Our code can be found at [1]. Best regards Ricarda [1]: https://github.com/DBDA15/graph-mining/tree/master/graph-mining-flink |
Hi! Can you switch to version 0.9.1? That one included some bug fixes, including one or two possible deadlock situations. Please let us know if that solves the issue, or if the issue persists... Greetings, Stephan On Fri, Sep 4, 2015 at 7:19 PM, Ricarda Schueler <[hidden email]> wrote:
|
Hi,
we tested it with the version 0.9.1, but unfortunately the issue persists.
Best Ricarda
Von: [hidden email] <[hidden email]> im Auftrag von Stephan Ewen <[hidden email]>
Gesendet: Montag, 7. September 2015 00:39 An: [hidden email] Betreff: Re: Memory management issue Hi!
Can you switch to version 0.9.1? That one included some bug fixes, including one or two possible deadlock situations.
Please let us know if that solves the issue, or if the issue persists...
Greetings,
Stephan On Fri, Sep 4, 2015 at 7:19 PM, Ricarda Schueler
<[hidden email]> wrote:
|
Hey Ricarda,
I will try to reproduce this locally with the data sets in your repo. If you have any hints to reproduce this (available memory, which file you were using exactly), feel free to post it. :) – Ufuk > On 08 Sep 2015, at 10:12, Schueler, Ricarda <[hidden email]> wrote: > > Hi, > > we tested it with the version 0.9.1, but unfortunately the issue persists. > > Best > Ricarda > > Von: [hidden email] <[hidden email]> im Auftrag von Stephan Ewen <[hidden email]> > Gesendet: Montag, 7. September 2015 00:39 > An: [hidden email] > Betreff: Re: Memory management issue > > Hi! > > Can you switch to version 0.9.1? That one included some bug fixes, including one or two possible deadlock situations. > > Please let us know if that solves the issue, or if the issue persists... > > Greetings, > Stephan > > > On Fri, Sep 4, 2015 at 7:19 PM, Ricarda Schueler <[hidden email]> wrote: > > Hi All, > > We're running into a memory management issue when using the > iterateWithTermination function. > Using a small amount of data, everything works perfectly fine. However, > as soon as the main memory is filled up on a worker, nothing seems to be > happening any more. Once this happens, any worker whose memory is full > will have its CPU workload drop to a minimum (<5%), while maintaining a > full memory with no apparent garbage collection happening and thus the > memory remaining full. All Tasks within this iteration are set to > started, yet none of them actually do anything measurable. > While runs with slightly less data (so that all intermediate results > barely fit into main memory) finished within minutes, runs where the > data would no longer fit would run for days with no results in sight. > When using fewer workers or even running the algorithm locally, this > issue already appears when using less data, which the larger cluster > (with more combined memory) could still handle. > > Our code can be found at [1]. > > Best regards > Ricarda > > [1]: https://github.com/DBDA15/graph-mining/tree/master/graph-mining-flink > > > > > |
> On 08 Sep 2015, at 10:41, Ufuk Celebi <[hidden email]> wrote: > > Hey Ricarda, > > I will try to reproduce this locally with the data sets in your repo. I just saw that the data is very small. Can you point me to a data set to reproduce? |
In reply to this post by Schueler, Ricarda
> On 08 Sep 2015, at 10:12, Schueler, Ricarda <[hidden email]> wrote: > > Hi, > > we tested it with the version 0.9.1, but unfortunately the issue persists. Thanks for helping me out debugging this Ricarda! :) From what I can tell, this is not a deadlock in the network runtime, but a join deadlock within an iteration. https://gist.github.com/uce/3fd5ca45383402ed1b16 @Stephan, Fabian: What’s the best way to fix this for good? @Ricarda: you can work your way around this by providing JoinHint.REPARTITION_SORT_MERGE as a join hint in the bulk iteration, i.e. joinedtriangles = joinedtriangles.join(graph, JoinHint.REPARTITION_SORT_MERGE).where({triangle => (triangle.edge3.vertex1, triangle.edge3.vertex2)}).equalTo("vertex1", "vertex2"){ (triangle, edge) => triangle.edge3.triangleCount = edge.triangleCount triangle }.name("third triangle edge join”) I saw that you were benchmarking this for a project. This should impact the runtime of your program, so you might need to re-run the experiments. – Ufuk |
A quickfix would be to take the first join and give it a "JoinHint.REPARTITION_HASH_BUILD_SECOND" hint. The second best thing would be to recognize in the optimizer that a batch exchange cannot happen (if inside an iteration) and instead set the receiver task to break the pipeline (set TempMode.makePipelineBreaker()) On Tue, Sep 8, 2015 at 12:43 PM, Ufuk Celebi <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |