(DEPRECATED) Apache Flink User Mailing List archive.

Re: Issue with single job yarn flink cluster HA

Classic

List

Threaded

1 message

Ken Krugler

Re: Issue with single job yarn flink cluster HA

Hi Dinesh,

Did updating to Flink 1.10 resolve the issue?

Thanks,

— Ken

Hi Andrey,
Sure We will try to use Flink 1.10 to see if HA issues we are facing is fixed and update in this thread.

Thanks,
Dinesh

On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Dinesh,

Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. [1] and [2].
I would suggest to try Flink 1.10.
If the problem persists, could you also find the logs of the failed Job Manager before the failover?

Best,
Andrey

[1] https://jira.apache.org/jira/browse/FLINK-14316
[2] https://jira.apache.org/jira/browse/FLINK-11843

On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <[hidden email]> wrote:
Hi Yang,
I am attaching one full jobmanager log for a job which I reran today. This a job that tries to read from savepoint.
Same error message "leader election onging" is displayed. And this stays the same even after 30 minutes. If I leave the job without yarn kill, it stays the same forever.
Based on your suggestions till now, I guess it might be some zookeeper problem. If that is the case, what can I lookout for in zookeeper to figure out the issue?

Thanks,
Dinesh

[snip]

--------------------------

Ken Krugler

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr