(DEPRECATED) Apache Flink User Mailing List archive.

Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Classic

List

Threaded

17 messages Options

Vishal Santoshi

Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun

.....

Any feedback would be highly appreciated...

Fabian Hueske-2

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

Screen Shot 2018-06-25 at 9.01.17 AM.png (194K) Download Attachment

Chesnay Schepler

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms

at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)

at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)

at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)

at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

I think all I need to add is

web.port: 8081

rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:

Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms

at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)

at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)

at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)

at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

Gary Yao-2

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:

I think all I need to add is

web.port: 8081

rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms

at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)

at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)

at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)

at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

zhangminglei

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers

Minglei

在 2018年6月26日，下午4:53，Gary Yao <[hidden email]> 写道：

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is

web.port: 8081
rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun
.....

Any feedback would be highly appreciated...

zhangminglei

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

By the way, in HA set up.

在 2018年6月26日，下午5:39，zhangminglei <[hidden email]> 写道：

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日，下午4:53，Gary Yao <[hidden email]> 写道：

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is

web.port: 8081
rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun
.....

Any feedback would be highly appreciated...

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

OK few things

2018-06-26 13:31:29 INFO CliFrontend:282 - Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)

...

2018-06-26 13:31:31 INFO ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3 replyHeader:: 2,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3 replyHeader:: 3,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3 replyHeader:: 4,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3 replyHeader:: 5,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3 replyHeader:: 6,60416530560,-101 request:: '/flink_test/da_15/leader/rest_server_lock,T response::

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3 replyHeader:: 7,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3 replyHeader:: 8,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3 replyHeader:: 9,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3 replyHeader:: 10,60416530560,-101 request:: '/flink_test/da_15/leader/dispatcher_lock,T response::

2018-06-26 13:31:31 INFO CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,60416530561,0 request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)

On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:

By the way, in HA set up.

在 2018年6月26日，下午5:39，zhangminglei <[hidden email]> 写道：

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日，下午4:53，Gary Yao <[hidden email]> 写道：

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is

web.port: 8081
rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun
.....

Any feedback would be highly appreciated...

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

The leader znode is the right one ( it is a binary )

get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock

wFDakka.tcp://flink@...:22161/user/jobmanagersrjava.util.UUIDm/J

leastSigBitsJ

mostSigBitsxpHv

So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) .

On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[hidden email]> wrote:

OK few things

2018-06-26 13:31:29 INFO CliFrontend:282 - Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)

...

2018-06-26 13:31:31 INFO ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3 replyHeader:: 2,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3 replyHeader:: 3,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3 replyHeader:: 4,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3 replyHeader:: 5,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3 replyHeader:: 6,60416530560,-101 request:: '/flink_test/da_15/leader/rest_server_lock,T response::

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3 replyHeader:: 7,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3 replyHeader:: 8,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3 replyHeader:: 9,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3 replyHeader:: 10,60416530560,-101 request:: '/flink_test/da_15/leader/dispatcher_lock,T response::

2018-06-26 13:31:31 INFO CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,60416530561,0 request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)

On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
By the way, in HA set up.

在 2018年6月26日，下午5:39，zhangminglei <[hidden email]> 写道：

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日，下午4:53，Gary Yao <[hidden email]> 写道：

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is

web.port: 8081
rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun
.....

Any feedback would be highly appreciated...

Gary Yao-2

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Hi Vishal,

You should check the contents of znode /flink_test/[...]/rest_server_lock to see
if the URL is correct.

The host and port should be logged by the RestClient [1]. If you do not see the
message "Sending request of class [...]]" on DEBUG level, probably the client is
not able to get the address from ZK in the first place.

Lastly, the stacktrace you posted seems to be cut off. Can you attach the
complete client logs?

Best,
Gary

[1] https://github.com/apache/flink/blob/release-1.5.0-rc6/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L156

On Tue, Jun 26, 2018 at 4:09 PM, Vishal Santoshi <[hidden email]> wrote:

The leader znode is the right one ( it is a binary )

get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock

wFDakka.tcp://flink@flink-9edd15d7.bf2.tumblr.net:22161/user/jobmanagersrjava.util.UUIDm/J

   leastSigBitsJ

mostSigBitsxpHv

So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) .

On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[hidden email]> wrote:
OK few things

2018-06-26 13:31:29 INFO CliFrontend:282 - Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)

...

2018-06-26 13:31:31 INFO ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3 replyHeader:: 2,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3 replyHeader:: 3,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3 replyHeader:: 4,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3 replyHeader:: 5,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3 replyHeader:: 6,60416530560,-101 request:: '/flink_test/da_15/leader/rest_server_lock,T response::

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3 replyHeader:: 7,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3 replyHeader:: 8,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3 replyHeader:: 9,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3 replyHeader:: 10,60416530560,-101 request:: '/flink_test/da_15/leader/dispatcher_lock,T response::

2018-06-26 13:31:31 INFO CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,60416530561,0 request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)

On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
By the way, in HA set up.

在 2018年6月26日，下午5:39，zhangminglei <[hidden email]> 写道：

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日，下午4:53，Gary Yao <[hidden email]> 写道：

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is

web.port: 8081
rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun
.....

Any feedback would be highly appreciated...

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Ok, I will check.

On Tue, Jun 26, 2018, 12:39 PM Gary Yao <[hidden email]> wrote:

Hi Vishal,

You should check the contents of znode /flink_test/[...]/rest_server_lock to see
if the URL is correct.

The host and port should be logged by the RestClient [1]. If you do not see the
message "Sending request of class [...]]" on DEBUG level, probably the client is
not able to get the address from ZK in the first place.

Lastly, the stacktrace you posted seems to be cut off. Can you attach the
complete client logs?

Best,
Gary

[1] https://github.com/apache/flink/blob/release-1.5.0-rc6/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L156

On Tue, Jun 26, 2018 at 4:09 PM, Vishal Santoshi <[hidden email]> wrote:
The leader znode is the right one ( it is a binary )

get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock

wFDakka.tcp://flink@...:22161/user/jobmanagersrjava.util.UUIDm/J

   leastSigBitsJ

mostSigBitsxpHv

So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) .

On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[hidden email]> wrote:
OK few things

2018-06-26 13:31:29 INFO CliFrontend:282 - Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)

...

2018-06-26 13:31:31 INFO ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3 replyHeader:: 2,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3 replyHeader:: 3,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3 replyHeader:: 4,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3 replyHeader:: 5,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3 replyHeader:: 6,60416530560,-101 request:: '/flink_test/da_15/leader/rest_server_lock,T response::

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3 replyHeader:: 7,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3 replyHeader:: 8,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3 replyHeader:: 9,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888}

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3 replyHeader:: 10,60416530560,-101 request:: '/flink_test/da_15/leader/dispatcher_lock,T response::

2018-06-26 13:31:31 INFO CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,60416530561,0 request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)

On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
By the way, in HA set up.

在 2018年6月26日，下午5:39，zhangminglei <[hidden email]> 写道：

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日，下午4:53，Gary Yao <[hidden email]> 写道：

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is

web.port: 8081
rest.port: 8081

to the JM flink conf ?

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission and it does succeed through the UI.

Has there been a regression on CLI... I could not find any documentation around it.

I have a HA JM setup.

On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun
.....

Any feedback would be highly appreciated...

Christophe Jolif

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

In reply to this post by Chesnay Schepler

Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,

Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

Christophe

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

I am not seeing rest_server_lock. Is it transient ( ephemeral znode ) for the duration of the cli command ?

[zk: localhost:2181(CONNECTED) 2] ls /flink_test/da_15

[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter]

The logs say

2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Is this a relative path,

high-availability.zookeeper.path.root: /flink_test

high-availability.cluster-id: /da_15

I do not see /leader/rest_server_lock both during the duration of the cli run ( or before or after ).

I am a little stumped.... I do not see the above logs on 1.4 so am not sure whether /leader/rest_server_lock is the new code...

On Thu, Jun 28, 2018 at 3:30 AM, Christophe Jolif <[hidden email]> wrote:

Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,
--
Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

--
Christophe

Gary Yao-2

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Hi Vishal,

The znode /flink_test/da_15/leader/rest_server_lock should exist as long as your
Flink 1.5 cluster is running. In 1.4 this znode will not be created. Are you
sure that the znode does not exist? Unfortunately you only attached the output
of "ls /flink_test/da_15".

Can you share the complete JobManager log files from a cluster that is
(re-)starting?

Best,
Gary

On Thu, Jun 28, 2018 at 4:10 PM, Vishal Santoshi <[hidden email]> wrote:

I am not seeing rest_server_lock. Is it transient ( ephemeral znode ) for the duration of the cli command ?

[zk: localhost:2181(CONNECTED) 2] ls /flink_test/da_15

[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter]

The logs say

2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Is this a relative path,

high-availability.zookeeper.path.root: /flink_test

high-availability.cluster-id: /da_15

I do not see  /leader/rest_server_lock both during the duration of the cli run ( or before or after ).

I am a little stumped.... I do not see the above logs on 1.4 so am not sure whether /leader/rest_server_lock is the new code...

On Thu, Jun 28, 2018 at 3:30 AM, Christophe Jolif <[hidden email]> wrote:
Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,
--
Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

--
Christophe

Vishal Santoshi

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Thanks! I did not see the znode and thus did not paste the ls...anywaz will get you the full JM log ASAP....

On Thu, Jun 28, 2018, 5:35 PM Gary Yao <[hidden email]> wrote:

Hi Vishal,

The znode /flink_test/da_15/leader/rest_server_lock should exist as long as your
Flink 1.5 cluster is running. In 1.4 this znode will not be created. Are you
sure that the znode does not exist? Unfortunately you only attached the output
of "ls /flink_test/da_15".

Can you share the complete JobManager log files from a cluster that is
(re-)starting?

Best,
Gary

On Thu, Jun 28, 2018 at 4:10 PM, Vishal Santoshi <[hidden email]> wrote:

I am not seeing rest_server_lock. Is it transient ( ephemeral znode ) for the duration of the cli command ?

[zk: localhost:2181(CONNECTED) 2] ls /flink_test/da_15

[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter]

The logs say

2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

Is this a relative path,

high-availability.zookeeper.path.root: /flink_test

high-availability.cluster-id: /da_15

I do not see  /leader/rest_server_lock both during the duration of the cli run ( or before or after ).

I am a little stumped.... I do not see the above logs on 1.4 so am not sure whether /leader/rest_server_lock is the new code...

On Thu, Jun 28, 2018 at 3:30 AM, Christophe Jolif <[hidden email]> wrote:
Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,
--
Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:

The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:

Thank you....

One addition

I do not see WM info on the UI ( Attached )

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:

Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.

2. Savepoints should be compatible.

3. You can keep the slot configuration as before.

4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian

2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.
Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2

2.

Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?

3.

https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?

4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun
.....

Any feedback would be highly appreciated...

--
Christophe