Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...

Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Fabian Hueske-2
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...


Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...




Screen Shot 2018-06-25 at 9.01.17 AM.png (194K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Chesnay Schepler
The watermark issue is know and will be fixed in 1.5.1

On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...




Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms

at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)

at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)

at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)

at org.apach


This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 


I have a HA JM setup.





On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...





Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi
I think all I need to add is 

web.port: 8081

rest.port: 8081


to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms

at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)

at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)

at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)

at org.apach


This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 


I have a HA JM setup.





On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...






Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Gary Yao-2
Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is 

web.port: 8081

rest.port: 8081


to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms

at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)

at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)

at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)

at org.apach


This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 


I have a HA JM setup.





On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...







Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

zhangminglei
Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日,下午4:53,Gary Yao <[hidden email]> 写道:

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is 

web.port: 8081
rest.port: 8081

to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 

I have a HA JM setup.




On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2



2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun 
.....



Any feedback would be highly appreciated...








Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

zhangminglei
By the way, in HA set up.

在 2018年6月26日,下午5:39,zhangminglei <[hidden email]> 写道:

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日,下午4:53,Gary Yao <[hidden email]> 写道:

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is 

web.port: 8081
rest.port: 8081

to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 

I have a HA JM setup.




On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2



2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun 
.....



Any feedback would be highly appreciated...









Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi
OK few things 

2018-06-26 13:31:29 INFO  CliFrontend:282 -  Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)


...

2018-06-26 13:31:31 INFO  ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO  ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO  RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 1,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 2,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 3,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 4,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3  replyHeader:: 6,60416530560,-101  request:: '/flink_test/da_15/leader/rest_server_lock,T  response::  

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3  replyHeader:: 7,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3  replyHeader:: 8,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3  replyHeader:: 9,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3  replyHeader:: 10,60416530560,-101  request:: '/flink_test/da_15/leader/dispatcher_lock,T  response::  

2018-06-26 13:31:31 INFO  CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO  RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO  RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO  CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11  replyHeader:: 11,60416530561,0  request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO  ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO  ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)



On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
By the way, in HA set up.

在 2018年6月26日,下午5:39,zhangminglei <[hidden email]> 写道:

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日,下午4:53,Gary Yao <[hidden email]> 写道:

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is 

web.port: 8081
rest.port: 8081

to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 

I have a HA JM setup.




On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2



2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun 
.....



Any feedback would be highly appreciated...










Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi
The leader znode is the right one  ( it is a binary ) 

get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock

wFDakka.tcp://flink@...:22161/user/jobmanagersrjava.util.UUIDm/J

                                                                                                 leastSigBitsJ

                                                                                                              mostSigBitsxpHv


So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) .



On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[hidden email]> wrote:
OK few things 

2018-06-26 13:31:29 INFO  CliFrontend:282 -  Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)


...

2018-06-26 13:31:31 INFO  ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO  ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO  RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 1,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 2,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 3,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 4,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3  replyHeader:: 6,60416530560,-101  request:: '/flink_test/da_15/leader/rest_server_lock,T  response::  

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3  replyHeader:: 7,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3  replyHeader:: 8,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3  replyHeader:: 9,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3  replyHeader:: 10,60416530560,-101  request:: '/flink_test/da_15/leader/dispatcher_lock,T  response::  

2018-06-26 13:31:31 INFO  CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO  RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO  RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO  CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11  replyHeader:: 11,60416530561,0  request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO  ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO  ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)



On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
By the way, in HA set up.

在 2018年6月26日,下午5:39,zhangminglei <[hidden email]> 写道:

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日,下午4:53,Gary Yao <[hidden email]> 写道:

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is 

web.port: 8081
rest.port: 8081

to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 

I have a HA JM setup.




On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2



2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun 
.....



Any feedback would be highly appreciated...











Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Gary Yao-2
Hi Vishal,

You should check the contents of znode /flink_test/[...]/rest_server_lock to see
if the URL is correct.

The host and port should be logged by the RestClient [1]. If you do not see the
message "Sending request of class [...]]" on DEBUG level, probably the client is
not able to get the address from ZK in the first place.

Lastly, the stacktrace you posted seems to be cut off. Can you attach the
complete client logs?

Best,
Gary

[1] https://github.com/apache/flink/blob/release-1.5.0-rc6/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L156

On Tue, Jun 26, 2018 at 4:09 PM, Vishal Santoshi <[hidden email]> wrote:
The leader znode is the right one  ( it is a binary ) 

get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock

wFDakka.tcp://flink@flink-9edd15d7.bf2.tumblr.net:22161/user/jobmanagersrjava.util.UUIDm/J

                                                                                                 leastSigBitsJ

                                                                                                              mostSigBitsxpHv


So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) .



On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[hidden email]> wrote:
OK few things 

2018-06-26 13:31:29 INFO  CliFrontend:282 -  Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)


...

2018-06-26 13:31:31 INFO  ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO  ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO  RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 1,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 2,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 3,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 4,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3  replyHeader:: 6,60416530560,-101  request:: '/flink_test/da_15/leader/rest_server_lock,T  response::  

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3  replyHeader:: 7,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3  replyHeader:: 8,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3  replyHeader:: 9,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3  replyHeader:: 10,60416530560,-101  request:: '/flink_test/da_15/leader/dispatcher_lock,T  response::  

2018-06-26 13:31:31 INFO  CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO  RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO  RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO  CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11  replyHeader:: 11,60416530561,0  request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO  ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO  ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)



On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
By the way, in HA set up.

在 2018年6月26日,下午5:39,zhangminglei <[hidden email]> 写道:

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日,下午4:53,Gary Yao <[hidden email]> 写道:

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is 

web.port: 8081
rest.port: 8081

to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 

I have a HA JM setup.




On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2



2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun 
.....



Any feedback would be highly appreciated...












Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi
Ok, I will check. 

On Tue, Jun 26, 2018, 12:39 PM Gary Yao <[hidden email]> wrote:
Hi Vishal,

You should check the contents of znode /flink_test/[...]/rest_server_lock to see
if the URL is correct.

The host and port should be logged by the RestClient [1]. If you do not see the
message "Sending request of class [...]]" on DEBUG level, probably the client is
not able to get the address from ZK in the first place.

Lastly, the stacktrace you posted seems to be cut off. Can you attach the
complete client logs?

Best,
Gary

[1] https://github.com/apache/flink/blob/release-1.5.0-rc6/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L156

On Tue, Jun 26, 2018 at 4:09 PM, Vishal Santoshi <[hidden email]> wrote:
The leader znode is the right one  ( it is a binary ) 

get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock

wFDakka.tcp://flink@...:22161/user/jobmanagersrjava.util.UUIDm/J

                                                                                                 leastSigBitsJ

                                                                                                              mostSigBitsxpHv


So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) .



On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[hidden email]> wrote:
OK few things 

2018-06-26 13:31:29 INFO  CliFrontend:282 -  Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC)


...

2018-06-26 13:31:31 INFO  ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session

2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181

2018-06-26 13:31:31 INFO  ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000

2018-06-26 13:31:31 INFO  RestClient:119 - Rest client endpoint started.

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3  replyHeader:: 1,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3  replyHeader:: 2,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3  replyHeader:: 3,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3  replyHeader:: 4,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3  replyHeader:: 6,60416530560,-101  request:: '/flink_test/da_15/leader/rest_server_lock,T  response::  

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3  replyHeader:: 7,60416530560,0  request:: '/flink_test,F  response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3  replyHeader:: 8,60416530560,0  request:: '/flink_test/da_15,F  response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3  replyHeader:: 9,60416530560,0  request:: '/flink_test/da_15/leader,F  response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 

2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3  replyHeader:: 10,60416530560,-101  request:: '/flink_test/da_15/leader/dispatcher_lock,T  response::  

2018-06-26 13:31:31 INFO  CliFrontend:914 - Waiting for response...

Waiting for response...

2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms

2018-06-26 13:32:01 INFO  RestClient:123 - Shutting down rest endpoint.

2018-06-26 13:32:01 INFO  RestClient:140 - Rest endpoint shutdown complete.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-26 13:32:01 INFO  ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing

2018-06-26 13:32:01 INFO  CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting

2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing

2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing

2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11  replyHeader:: 11,60416530561,0  request:: null response:: null

2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07

2018-06-26 13:32:01 INFO  ZooKeeper:684 - Session: 0x35add547801ea07 closed

2018-06-26 13:32:01 INFO  ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07

2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket

2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command.

org.apache.flink.util.FlinkException: Failed to retrieve job list.

at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429)

at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412)



On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
By the way, in HA set up.

在 2018年6月26日,下午5:39,zhangminglei <[hidden email]> 写道:

Hi, Gary Yao

Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ?

Cheers
Minglei

在 2018年6月26日,下午4:53,Gary Yao <[hidden email]> 写道:

Hi Vishal,

Could it be that you are not using the 1.5.0 client? The stacktrace you posted
does not reference valid lines of code in the release-1.5.0-rc6 tag.

If you have a HA setup, the host and port of the leading JM will be looked up
from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the
client must have the same ZooKeeper configuration as used by the Flink cluster.

Best,
Gary

On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
I think all I need to add is 

web.port: 8081
rest.port: 8081

to the JM flink conf ? 

On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
Another issue I saw with flink cli...

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apach

This was a simple submission  and it does succeed through the UI. 

Has there been a regression on CLI... I could not find any documentation around it. 

I have a HA JM setup.




On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2



2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)
Jun 
.....



Any feedback would be highly appreciated...












Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Christophe Jolif
In reply to this post by Chesnay Schepler
Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,
--
Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...







--
Christophe
Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi

I am not seeing rest_server_lock. Is it transient ( ephemeral znode )  for the duration of the cli command ?


[zk: localhost:2181(CONNECTED) 2] ls /flink_test/da_15

[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter]



The logs say 

2018-06-28 14:02:56 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-28 14:02:56 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.


Is this a relative path, 

high-availability.zookeeper.path.root: /flink_test

high-availability.cluster-id: /da_15



I do not see  /leader/rest_server_lock both during the duration of the cli run ( or before or after ). 

I am a little stumped.... I do not see the above logs on 1.4 so am not sure whether /leader/rest_server_lock is the new code...


On Thu, Jun 28, 2018 at 3:30 AM, Christophe Jolif <[hidden email]> wrote:
Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,
--
Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...







--
Christophe

Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Gary Yao-2
Hi Vishal,

The znode /flink_test/da_15/leader/rest_server_lock should exist as long as your
Flink 1.5 cluster is running. In 1.4 this znode will not be created. Are you
sure that the znode does not exist? Unfortunately you only attached the output
of "ls /flink_test/da_15".

Can you share the complete JobManager log files from a cluster that is
(re-)starting?

Best,
Gary

On Thu, Jun 28, 2018 at 4:10 PM, Vishal Santoshi <[hidden email]> wrote:

I am not seeing rest_server_lock. Is it transient ( ephemeral znode )  for the duration of the cli command ?


[zk: localhost:2181(CONNECTED) 2] ls /flink_test/da_15

[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter]



The logs say 

2018-06-28 14:02:56 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-28 14:02:56 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.


Is this a relative path, 

high-availability.zookeeper.path.root: /flink_test

high-availability.cluster-id: /da_15



I do not see  /leader/rest_server_lock both during the duration of the cli run ( or before or after ). 

I am a little stumped.... I do not see the above logs on 1.4 so am not sure whether /leader/rest_server_lock is the new code...


On Thu, Jun 28, 2018 at 3:30 AM, Christophe Jolif <[hidden email]> wrote:
Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,
--
Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...







--
Christophe


Reply | Threaded
Open this post in threaded view
|

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

Vishal Santoshi
Thanks! I did not see the znode and thus did not paste the ls...anywaz will get you the full JM log ASAP....

On Thu, Jun 28, 2018, 5:35 PM Gary Yao <[hidden email]> wrote:
Hi Vishal,

The znode /flink_test/da_15/leader/rest_server_lock should exist as long as your
Flink 1.5 cluster is running. In 1.4 this znode will not be created. Are you
sure that the znode does not exist? Unfortunately you only attached the output
of "ls /flink_test/da_15".

Can you share the complete JobManager log files from a cluster that is
(re-)starting?

Best,
Gary

On Thu, Jun 28, 2018 at 4:10 PM, Vishal Santoshi <[hidden email]> wrote:

I am not seeing rest_server_lock. Is it transient ( ephemeral znode )  for the duration of the cli command ?


[zk: localhost:2181(CONNECTED) 2] ls /flink_test/da_15

[jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter]



The logs say 

2018-06-28 14:02:56 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

2018-06-28 14:02:56 INFO  ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.


Is this a relative path, 

high-availability.zookeeper.path.root: /flink_test

high-availability.cluster-id: /da_15



I do not see  /leader/rest_server_lock both during the duration of the cli run ( or before or after ). 

I am a little stumped.... I do not see the above logs on 1.4 so am not sure whether /leader/rest_server_lock is the new code...


On Thu, Jun 28, 2018 at 3:30 AM, Christophe Jolif <[hidden email]> wrote:
Chesnay,

Do you have rough idea of the 1.5.1 timeline?

Thanks,
--
Christophe

On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:
The watermark issue is know and will be fixed in 1.5.1


On 25.06.2018 15:03, Vishal Santoshi wrote:
Thank you....  

One addition

I do not see WM info on the UI  ( Attached ) 

Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with  Watermarks not appearing ) . Am I missing something ?

On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Hi Vishal,

1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down.
2. Savepoints should be compatible.
3. You can keep the slot configuration as before.
4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible).

Best, Fabian


2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:

1.  
Can or has any one  done  a rolling upgrade from 1.4 to 1.5 ?  I am not sure we can. It seems that JM cannot recover jobs with this exception

Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2




2. 
Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ?



3. 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ?


4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start  with 

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330.

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079)

Jun 

.....



Any feedback would be highly appreciated...







--
Christophe