Hi
We have in both Flink 1.9.2 and 1.10 struggled with random deserialze and Index out of range exception in one of our job. We also get out of memory exceptions. We have now identified it as a latency tracking together with broadcast state Causing the problem. When we do integration testing locally we don’t see any problem it’s only fails running on the cluster. We have concluded that latency tracking package send over broadcast cause the data stream to be corrupted and causing the exceptions. We work on preparing a simple project on github to reproduce the problem so the underlying problem can be solved. Anyone else have seen these kind of problems? Med venlig hilsen / Best regards Lasse Nedergaard |
Hi Lasse
Never meet this problem before, but can you share some exception stack trace so that we could take a look. The simple project to reproduce is also a good choice.
Best
Yun Tang
From: Lasse Nedergaard <[hidden email]>
Sent: Tuesday, March 31, 2020 19:10 To: user <[hidden email]> Subject: Latency tracking together with broadcast state can cause job failure Hi
We have in both Flink 1.9.2 and 1.10 struggled with random deserialze and Index out of range exception in one of our job. We also get out of memory exceptions. We have now identified it as a latency tracking together with broadcast state Causing the problem. When we do integration testing locally we don’t see any problem it’s only fails running on the cluster. We have concluded that latency tracking package send over broadcast cause the data stream to be corrupted and causing the exceptions. We work on preparing a simple project on github to reproduce the problem so the underlying problem can be solved. Anyone else have seen these kind of problems? Med venlig hilsen / Best regards Lasse Nedergaard |
Hi
I have attached a simple project with a test that reproduce the problem. The normal fault is a mixed string but you can also EOF exception. Please let me know if you have any questions to the solution. Med venlig hilsen / Best regards Lasse Nedergaard Den 1. apr. 2020 kl. 09.15 skrev Yun Tang <[hidden email]>:
Telematics2-feature-flink-1.10-latency-tracking-broken (97K) Download Attachment |
Hey Lasse, has the problem been resolved? (I'm also responding to this to make sure the thread gets attention again :) ) Best, Robert On Wed, Apr 1, 2020 at 10:03 PM Lasse Nedergaard <[hidden email]> wrote:
|
Hi Lasse
Really sorry for missing your reply. I'll run your project and find the root cause in my day time. And thanks for
[hidden email] 's kind remind.
Best
Yun Tang
From: Robert Metzger <[hidden email]>
Sent: Tuesday, April 21, 2020 20:01 To: Lasse Nedergaard <[hidden email]> Cc: Yun Tang <[hidden email]>; user <[hidden email]> Subject: Re: Latency tracking together with broadcast state can cause job failure Hey Lasse,
has the problem been resolved?
(I'm also responding to this to make sure the thread gets attention again :) )
Best,
Robert
On Wed, Apr 1, 2020 at 10:03 PM Lasse Nedergaard <[hidden email]> wrote:
|
Hi Lasse
After debug locally, this should be a bug in Flink (even the latest version). However, the bug should be caused in network stack with which I am not very familiar and not so easy to find root cause directly. After discussion with our network guys in Flink,
we decide to first create FLINK-17322 [1] to track this problem, and related owner would take a look at this problem.
Really thank you for reporting this bug.
Best
Yun Tang
From: Yun Tang <[hidden email]>
Sent: Wednesday, April 22, 2020 1:43 To: Lasse Nedergaard <[hidden email]> Cc: user <[hidden email]> Subject: Re: Latency tracking together with broadcast state can cause job failure
Hi Lasse
Really sorry for missing your reply. I'll run your project and find the root cause in my day time. And thanks for
[hidden email] 's kind remind.
Best
Yun Tang
From: Robert Metzger <[hidden email]>
Sent: Tuesday, April 21, 2020 20:01 To: Lasse Nedergaard <[hidden email]> Cc: Yun Tang <[hidden email]>; user <[hidden email]> Subject: Re: Latency tracking together with broadcast state can cause job failure Hey Lasse,
has the problem been resolved?
(I'm also responding to this to make sure the thread gets attention again :) )
Best,
Robert
On Wed, Apr 1, 2020 at 10:03 PM Lasse Nedergaard <[hidden email]> wrote:
|
Hi Yun
Thanks for looking into it and forwarded it to the right place.
Med venlig hilsen / Best regards Lasse Nedergaard Den 22. apr. 2020 kl. 11.06 skrev Yun Tang <[hidden email]>:
|
Hi Lasse, your reported issue [1] will be fixed in the next release of 1.10 and the upcoming 1.11. Thank you for your detailed report. On Wed, Apr 22, 2020 at 12:54 PM Lasse Nedergaard <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |