Re: Issue with single job yarn flink cluster HA

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Issue with single job yarn flink cluster HA

Ken Krugler
Hi Dinesh,

Did updating to Flink 1.10 resolve the issue?

Thanks,

— Ken

Hi Andrey,
Sure We will try to use Flink 1.10 to see if HA issues we are facing is fixed and update in this thread.

Thanks,
Dinesh

On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. [1] and [2].
I would suggest to try Flink 1.10.
If the problem persists, could you also find the logs of the failed Job Manager before the failover?

Best,
Andrey


On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <[hidden email]> wrote:
Hi Yang,
I am attaching one full jobmanager log for a job which I reran today. This a job that tries to read from savepoint.
Same error message "leader election onging" is displayed. And this stays the same even after 30 minutes. If I leave the job without yarn kill, it stays the same forever.
Based on your suggestions till now, I guess it might be some zookeeper problem. If that is the case, what can I lookout for in zookeeper to figure out the issue?

Thanks,
Dinesh


[snip]

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr