HA failing for 1.6.0 job cluster with docker-compose

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

HA failing for 1.6.0 job cluster with docker-compose

Tzanko Matev
Dear all,

I am currently experimenting with a Flink 1.6.0 job cluster. The goal is to run a streaming job on K8s. Right now I am using docker-compose to experiment with the job cluster.

I am trying to set-up HA with Zookeeper, but I seem to fail. I have a docker-compose file which contains the following services:
- Zookeeper
- Flink job manager
- Flink task manager

The containers are set up as per the documentation for docker-compose, but I have also set up the necessary HA settings in the conf file. However, when I kill the job manager container and start it again, the job being processed does not recover but always starts from scratch. Instead I get the following error:

> ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Could not retrieve the redirect address.
>
> java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(8c4887f5c13f6d907d82a55d97ac428f, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@blockprocessor-job-cluster:50000/user/dispatcher because the fencing token is null.

Am I missing something? Is HA implemented for job clusters at all?

Best wishes,
Tzanko Matev

Reply | Threaded
Open this post in threaded view
|

Re: HA failing for 1.6.0 job cluster with docker-compose

vino yang
Hi Tzanko,

Maybe Till is more appropriate to answer this question.

Thanks, vino.

Tzanko Matev <[hidden email]> 于2018年9月19日周三 下午5:47写道:
Dear all,

I am currently experimenting with a Flink 1.6.0 job cluster. The goal is to run a streaming job on K8s. Right now I am using docker-compose to experiment with the job cluster.

I am trying to set-up HA with Zookeeper, but I seem to fail. I have a docker-compose file which contains the following services:
- Zookeeper
- Flink job manager
- Flink task manager

The containers are set up as per the documentation for docker-compose, but I have also set up the necessary HA settings in the conf file. However, when I kill the job manager container and start it again, the job being processed does not recover but always starts from scratch. Instead I get the following error:

> ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Could not retrieve the redirect address.
>
> java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(8c4887f5c13f6d907d82a55d97ac428f, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@blockprocessor-job-cluster:50000/user/dispatcher because the fencing token is null.

Am I missing something? Is HA implemented for job clusters at all?

Best wishes,
Tzanko Matev

Reply | Threaded
Open this post in threaded view
|

Re: HA failing for 1.6.0 job cluster with docker-compose

Till Rohrmann
Hi Tzanko,

in order to make the container entrypoint properly work with HA, we need to fix the JobID (see https://issues.apache.org/jira/browse/FLINK-10291). At the moment, we generate a new JobID for every restart of the cluster entrypoint container. Due to that the system cannot find the existing checkpoints.

Fixing the JobID is not a big deal and it should be fixed with the next bug fix release.

Cheers,
Till

On Thu, Sep 20, 2018 at 10:12 AM vino yang <[hidden email]> wrote:
Hi Tzanko,

Maybe Till is more appropriate to answer this question.

Thanks, vino.

Tzanko Matev <[hidden email]> 于2018年9月19日周三 下午5:47写道:
Dear all,

I am currently experimenting with a Flink 1.6.0 job cluster. The goal is to run a streaming job on K8s. Right now I am using docker-compose to experiment with the job cluster.

I am trying to set-up HA with Zookeeper, but I seem to fail. I have a docker-compose file which contains the following services:
- Zookeeper
- Flink job manager
- Flink task manager

The containers are set up as per the documentation for docker-compose, but I have also set up the necessary HA settings in the conf file. However, when I kill the job manager container and start it again, the job being processed does not recover but always starts from scratch. Instead I get the following error:

> ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Could not retrieve the redirect address.
>
> java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(8c4887f5c13f6d907d82a55d97ac428f, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@blockprocessor-job-cluster:50000/user/dispatcher because the fencing token is null.

Am I missing something? Is HA implemented for job clusters at all?

Best wishes,
Tzanko Matev

Reply | Threaded
Open this post in threaded view
|

Re: HA failing for 1.6.0 job cluster with docker-compose

vino yang
Hi all,

Oh, I took this ticket, will fix it as soon as possible.

Thanks, vino.

Till Rohrmann <[hidden email]> 于2018年9月20日周四 下午4:35写道:
Hi Tzanko,

in order to make the container entrypoint properly work with HA, we need to fix the JobID (see https://issues.apache.org/jira/browse/FLINK-10291). At the moment, we generate a new JobID for every restart of the cluster entrypoint container. Due to that the system cannot find the existing checkpoints.

Fixing the JobID is not a big deal and it should be fixed with the next bug fix release.

Cheers,
Till

On Thu, Sep 20, 2018 at 10:12 AM vino yang <[hidden email]> wrote:
Hi Tzanko,

Maybe Till is more appropriate to answer this question.

Thanks, vino.

Tzanko Matev <[hidden email]> 于2018年9月19日周三 下午5:47写道:
Dear all,

I am currently experimenting with a Flink 1.6.0 job cluster. The goal is to run a streaming job on K8s. Right now I am using docker-compose to experiment with the job cluster.

I am trying to set-up HA with Zookeeper, but I seem to fail. I have a docker-compose file which contains the following services:
- Zookeeper
- Flink job manager
- Flink task manager

The containers are set up as per the documentation for docker-compose, but I have also set up the necessary HA settings in the conf file. However, when I kill the job manager container and start it again, the job being processed does not recover but always starts from scratch. Instead I get the following error:

> ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Could not retrieve the redirect address.
>
> java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(8c4887f5c13f6d907d82a55d97ac428f, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@blockprocessor-job-cluster:50000/user/dispatcher because the fencing token is null.

Am I missing something? Is HA implemented for job clusters at all?

Best wishes,
Tzanko Matev

Reply | Threaded
Open this post in threaded view
|

Re: HA failing for 1.6.0 job cluster with docker-compose

Tzanko Matev
Hi Vino and TIll,

That's great news. Thank you!

Cheers,
Tzanko



On Thu, Sep 20, 2018 at 11:43 AM vino yang <[hidden email]> wrote:
Hi all,

Oh, I took this ticket, will fix it as soon as possible.

Thanks, vino.

Till Rohrmann <[hidden email]> 于2018年9月20日周四 下午4:35写道:
Hi Tzanko,

in order to make the container entrypoint properly work with HA, we need to fix the JobID (see https://issues.apache.org/jira/browse/FLINK-10291). At the moment, we generate a new JobID for every restart of the cluster entrypoint container. Due to that the system cannot find the existing checkpoints.

Fixing the JobID is not a big deal and it should be fixed with the next bug fix release.

Cheers,
Till

On Thu, Sep 20, 2018 at 10:12 AM vino yang <[hidden email]> wrote:
Hi Tzanko,

Maybe Till is more appropriate to answer this question.

Thanks, vino.

Tzanko Matev <[hidden email]> 于2018年9月19日周三 下午5:47写道:
Dear all,

I am currently experimenting with a Flink 1.6.0 job cluster. The goal is to run a streaming job on K8s. Right now I am using docker-compose to experiment with the job cluster.

I am trying to set-up HA with Zookeeper, but I seem to fail. I have a docker-compose file which contains the following services:
- Zookeeper
- Flink job manager
- Flink task manager

The containers are set up as per the documentation for docker-compose, but I have also set up the necessary HA settings in the conf file. However, when I kill the job manager container and start it again, the job being processed does not recover but always starts from scratch. Instead I get the following error:

> ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  - Could not retrieve the redirect address.
>
> java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(8c4887f5c13f6d907d82a55d97ac428f, LocalRpcInvocation(requestRestAddress(Time))) sent to akka.tcp://flink@blockprocessor-job-cluster:50000/user/dispatcher because the fencing token is null.

Am I missing something? Is HA implemented for job clusters at all?

Best wishes,
Tzanko Matev