(DEPRECATED) Apache Flink User Mailing List archive.

Flink job server with HA

Classic

List

Threaded

7 messages Options

Boris Lublinsky

Flink job server with HA

I am trying to experiment with Flink Job server with HA and I am noticing, that in this case

method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can see that it is invoked in the case of session cluster when a job is added)

As a result, when I am trying to restart a Job Master, it finds no running jobs and is not trying to restore it.

Am I missing something?

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

Xintong Song

Re: Flink job server with HA

Hi Boris,

I think what you described that putJobGraph is not invoked in Flink job cluster is by design and should not cause a failure of job recovering. For a Flink job cluster, there is only one job graph to execute. Instead of uploading job graph to an already running cluster (like in a session cluster), the job graph in a Flink job cluster is uploaded before the cluster is started, together with the Flink framework jars. Please refer to MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.

I think we need more information to find the root cause of your problem. For example, can you explain what are the detailed operation steps do you perform when you say "trying to restart a Job Master".

Thank you~

Xintong Song

On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <[hidden email]> wrote:

I am trying to experiment with Flink Job server with HA and I am noticing, that in this case
method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can see that it is invoked in the case of session cluster when a job is added)
As a result, when I am trying to restart a Job Master, it finds no running jobs and is not trying to restore it.
Am I missing something?

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

Boris Lublinsky

Re: Flink job server with HA

Thanks,

Thats what I thought initially.

The issue is that because of this, during restart, it does not know which job was running before (it is obtained from submitted job graph store).

Because this is empty, there is no restarted jobs and the cluster does not even try to restore checkpoints.

I can see that checkpoints are stored correctly, but they are never accessed.

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:23 PM, Xintong Song <[hidden email]> wrote:

Hi Boris,

I think what you described that putJobGraph is not invoked in Flink job cluster is by design and should not cause a failure of job recovering. For a Flink job cluster, there is only one job graph to execute. Instead of uploading job graph to an already running cluster (like in a session cluster), the job graph in a Flink job cluster is uploaded before the cluster is started, together with the Flink framework jars. Please refer to MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.

I think we need more information to find the root cause of your problem. For example, can you explain what are the detailed operation steps do you perform when you say "trying to restart a Job Master".

Thank you~
Xintong Song

On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <[hidden email]> wrote:
I am trying to experiment with Flink Job server with HA and I am noticing, that in this case
method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can see that it is invoked in the case of session cluster when a job is added)
As a result, when I am trying to restart a Job Master, it finds no running jobs and is not trying to restore it.
Am I missing something?

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

Xintong Song

Re: Flink job server with HA

So here are my questions:

1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?

2. How do you trigger "restart a Job Master"?

Thank you~

Xintong Song

On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky <[hidden email]> wrote:

Thanks,
Thats what I thought initially.
The issue is that because of this, during restart, it does not know which job was running before (it is obtained from submitted job graph store).
Because this is empty, there is no restarted jobs and the cluster does not even try to restore checkpoints.
I can see that checkpoints are stored correctly, but they are never accessed.

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:23 PM, Xintong Song <[hidden email]> wrote:

Hi Boris,

I think what you described that putJobGraph is not invoked in Flink job cluster is by design and should not cause a failure of job recovering. For a Flink job cluster, there is only one job graph to execute. Instead of uploading job graph to an already running cluster (like in a session cluster), the job graph in a Flink job cluster is uploaded before the cluster is started, together with the Flink framework jars. Please refer to MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.

I think we need more information to find the root cause of your problem. For example, can you explain what are the detailed operation steps do you perform when you say "trying to restart a Job Master".

Thank you~
Xintong Song

On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <[hidden email]> wrote:
I am trying to experiment with Flink Job server with HA and I am noticing, that in this case
method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can see that it is invoked in the case of session cluster when a job is added)
As a result, when I am trying to restart a Job Master, it finds no running jobs and is not trying to restore it.
Am I missing something?

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

Boris Lublinsky

Re: Flink job server with HA

I am running on k8

Job master runs as a deployment of 1, so just killing a pod restarts it

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:46 PM, Xintong Song <[hidden email]> wrote:

So here are my questions:
1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
2. How do you trigger "restart a Job Master"?

Thank you~
Xintong Song

On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky <[hidden email]> wrote:
Thanks,
Thats what I thought initially.
The issue is that because of this, during restart, it does not know which job was running before (it is obtained from submitted job graph store).
Because this is empty, there is no restarted jobs and the cluster does not even try to restore checkpoints.
I can see that checkpoints are stored correctly, but they are never accessed.

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:23 PM, Xintong Song <[hidden email]> wrote:

Hi Boris,

I think what you described that putJobGraph is not invoked in Flink job cluster is by design and should not cause a failure of job recovering. For a Flink job cluster, there is only one job graph to execute. Instead of uploading job graph to an already running cluster (like in a session cluster), the job graph in a Flink job cluster is uploaded before the cluster is started, together with the Flink framework jars. Please refer to MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.

I think we need more information to find the root cause of your problem. For example, can you explain what are the detailed operation steps do you perform when you say "trying to restart a Job Master".

Thank you~
Xintong Song

On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <[hidden email]> wrote:
I am trying to experiment with Flink Job server with HA and I am noticing, that in this case
method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can see that it is invoked in the case of session cluster when a job is added)
As a result, when I am trying to restart a Job Master, it finds no running jobs and is not trying to restore it.
Am I missing something?

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

Xintong Song

Re: Flink job server with HA

If that is the case, then I would suggest you to check the following two things:

1. Is the HA mode configured properly in Flink configuration? There should be a config option "high-availability" in your flink-conf.yarml. If not configured, the default value would be "NONE".

2. It "ClassPathJobGraphRetriever#retrieveJobGraph" actually invoked, and is there any exceptions thrown from it. This is to verify whether the correct code path for job cluster is invoked.

Thank you~

Xintong Song

On Tue, Jun 4, 2019 at 10:48 AM Boris Lublinsky <[hidden email]> wrote:

I am running on k8
Job master runs as a deployment of 1, so just killing a pod restarts it

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:46 PM, Xintong Song <[hidden email]> wrote:

So here are my questions:
1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
2. How do you trigger "restart a Job Master"?

Thank you~
Xintong Song

On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky <[hidden email]> wrote:
Thanks,
Thats what I thought initially.
The issue is that because of this, during restart, it does not know which job was running before (it is obtained from submitted job graph store).
Because this is empty, there is no restarted jobs and the cluster does not even try to restore checkpoints.
I can see that checkpoints are stored correctly, but they are never accessed.

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:23 PM, Xintong Song <[hidden email]> wrote:

Hi Boris,

I think what you described that putJobGraph is not invoked in Flink job cluster is by design and should not cause a failure of job recovering. For a Flink job cluster, there is only one job graph to execute. Instead of uploading job graph to an already running cluster (like in a session cluster), the job graph in a Flink job cluster is uploaded before the cluster is started, together with the Flink framework jars. Please refer to MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.

I think we need more information to find the root cause of your problem. For example, can you explain what are the detailed operation steps do you perform when you say "trying to restart a Job Master".

Thank you~
Xintong Song

On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <[hidden email]> wrote:
I am trying to experiment with Flink Job server with HA and I am noticing, that in this case
method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can see that it is invoked in the case of session cluster when a job is added)
As a result, when I am trying to restart a Job Master, it finds no running jobs and is not trying to restore it.
Am I missing something?

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

Boris Lublinsky

Re: Flink job server with HA

And it works now.

My mistake

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 10:18 PM, Xintong Song <[hidden email]> wrote:

If that is the case, then I would suggest you to check the following two things:
1. Is the HA mode configured properly in Flink configuration? There should be a config option "high-availability" in your flink-conf.yarml. If not configured, the default value would be "NONE".
2. It "ClassPathJobGraphRetriever#retrieveJobGraph" actually invoked, and is there any exceptions thrown from it. This is to verify whether the correct code path for job cluster is invoked.

Thank you~
Xintong Song

On Tue, Jun 4, 2019 at 10:48 AM Boris Lublinsky <[hidden email]> wrote:
I am running on k8
Job master runs as a deployment of 1, so just killing a pod restarts it

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:46 PM, Xintong Song <[hidden email]> wrote:

So here are my questions:
1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
2. How do you trigger "restart a Job Master"?

Thank you~
Xintong Song

On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky <[hidden email]> wrote:
Thanks,
Thats what I thought initially.
The issue is that because of this, during restart, it does not know which job was running before (it is obtained from submitted job graph store).
Because this is empty, there is no restarted jobs and the cluster does not even try to restore checkpoints.
I can see that checkpoints are stored correctly, but they are never accessed.

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Jun 3, 2019, at 9:23 PM, Xintong Song <[hidden email]> wrote:

Hi Boris,

I think what you described that putJobGraph is not invoked in Flink job cluster is by design and should not cause a failure of job recovering. For a Flink job cluster, there is only one job graph to execute. Instead of uploading job graph to an already running cluster (like in a session cluster), the job graph in a Flink job cluster is uploaded before the cluster is started, together with the Flink framework jars. Please refer to MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.

I think we need more information to find the root cause of your problem. For example, can you explain what are the detailed operation steps do you perform when you say "trying to restart a Job Master".

Thank you~
Xintong Song

On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <[hidden email]> wrote:
I am trying to experiment with Flink Job server with HA and I am noticing, that in this case
method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can see that it is invoked in the case of session cluster when a job is added)
As a result, when I am trying to restart a Job Master, it finds no running jobs and is not trying to restore it.
Am I missing something?

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/