I'm running flink-1.1.4-bin-hadoop27-scala_2.11 and I'm running into an issue where after some period of time (measured in 1 - 3 hours) the JobManager gets an NPE and shuts itself down. The failure is at JobManager$$updateAccumulators$1.apply(JobManager.scala:1790). I'm using a custom accumulator[1], but can't tell from the JobManager code whether the issue is in my Accumulator, or is a bug in the JobManager. [1] https://github.com/NationalSecurityAgency/timely/blob/master/analytics/src/main/java/timely/analytics/flink/SortedStringAccumulator.java |
Can you pastebin the complete stack trace for the NPE ? Thanks On Thu, Jan 19, 2017 at 8:57 AM, Dave Marion <[hidden email]> wrote:
|
Noticed I didn't cc the user list. ---------- Original Message ---------- |
Hi!
It seems that the accumulator behaves in a non-standard way, but the JobManager should also catch that (log a warning or debug message) and simply continue (not crash). I'll try to add a patch that the JobManager tolerates these kinds of issues in the accumulators. Stephan On Thu, Jan 19, 2017 at 7:26 PM, Dave Marion <[hidden email]> wrote:
|
I opened this issue: https://issues.apache.org/jira/browse/FLINK-5585 Assuming the bug is what I think it is (cannot be 100% sure from just the small stack trace sample) it should be fixed soon... On Fri, Jan 20, 2017 at 10:29 AM, Stephan Ewen <[hidden email]> wrote:
|
In reply to this post by Stephan Ewen
Stephan,
Thanks for looking at this. Could you elaborate on the misbehavior in the accumulator? I'd like to fix it if it's incorrect. Dave On January 20, 2017 at 4:29 AM Stephan Ewen <[hidden email]> wrote: |
Hi!
My current assumption is that there is an accumulator that cannot be serialized. The SortedStringAccumulator looks fine at a first glance, but are there other accumulators involved? Do you see a message like that one in the log of one of the TaskManagers "Failed to serialize accumulators for task." with an exception stack trace? Stephan On Fri, Jan 20, 2017 at 2:10 PM, Dave Marion <[hidden email]> wrote:
|
I do see that message in one of the task manager logs 20ms before the NPE in the JobManager. Looking in that log, there is a ConcurrentModificationException in TreeMap, which my accumulator uses. I'll track this down, thanks for the pointer.
On January 20, 2017 at 8:27 AM Stephan Ewen <[hidden email]> wrote: |
Fixing my accumulator did the trick. I should note that the JobManager did not fail when I ran this previously against Flink 1.1.3. Thanks for the help!
Dave On January 20, 2017 at 8:45 AM Dave Marion <[hidden email]> wrote: |
Free forum by Nabble | Edit this page |