Can or has any one done a rolling upgrade from 1.4 to 1.5 ? I am not sure we can. It seems that JM cannot recover jobs with this exception Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2 2. Does SP on 1.4, resume on 1.5 ( pretty basic but no harm asking ) ? 3. https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html#update-configuration-for-reworked-job-deployment The taskmanager.numberOfTaskSlots: What would be the desired setting in a stand alone ( non mesos/yarn ) cluster ? 4. I suspend all jobs and establish 1.5 on the JM ( the TMs are still running with 1.4 ) . JM refuse to start with Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: 2018-06-23 11:34:23 ERROR JobManager:116 - Failed to recover job 454cd84a519f3b50e88bcb378d8a1330. Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: java.lang.InstantiationError: org.apache.flink.runtime.blob.BlobKey Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at sun.reflect.GeneratedSerializationConstructorAccessor51.newInstance(Unknown Source) Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.lang.reflect.Constructor.newInstance(Constructor.java:423) Jun 23 07:34:23 flink-ad21ac07.bf2.tumblr.net docker[3395]: at java.io.ObjectStreamClass.newInstance(ObjectStreamClass.java:1079) Jun .....Any feedback would be highly appreciated... |
Hi Vishal, 1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down. 2. Savepoints should be compatible. 3. You can keep the slot configuration as before. 4. As I said before, mixing 1.5 and 1.4 processes does not work (or at least, it was not considered a design goal and nobody paid attention that it is possible). Best, Fabian 2018-06-23 13:38 GMT+02:00 Vishal Santoshi <[hidden email]>:
|
Thank you.... One addition I do not see WM info on the UI ( Attached ) Is this a know issue. The same pipe on our production has the WM ( In fact never had an issue with Watermarks not appearing ) . Am I missing something ? On Mon, Jun 25, 2018 at 4:15 AM, Fabian Hueske <[hidden email]> wrote:
Screen Shot 2018-06-25 at 9.01.17 AM.png (194K) Download Attachment |
The watermark issue is know and will be
fixed in 1.5.1
On 25.06.2018 15:03, Vishal Santoshi wrote:
|
Another issue I saw with flink cli... org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 120000 ms at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524) at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456) at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77) at org.apach This was a simple submission and it does succeed through the UI. Has there been a regression on CLI... I could not find any documentation around it. I have a HA JM setup. On Mon, Jun 25, 2018 at 10:22 AM, Chesnay Schepler <[hidden email]> wrote:
|
I think all I need to add is web.port: 8081 rest.port: 8081 to the JM flink conf ? On Mon, Jun 25, 2018 at 10:46 AM, Vishal Santoshi <[hidden email]> wrote:
|
Hi Vishal,
Could it be that you are not using the 1.5.0 client? The stacktrace you posted does not reference valid lines of code in the release-1.5.0-rc6 tag. If you have a HA setup, the host and port of the leading JM will be looked up from ZooKeeper before job submission. Therefore, the flink-conf.yaml used by the client must have the same ZooKeeper configuration as used by the Flink cluster. Best, Gary On Mon, Jun 25, 2018 at 5:32 PM, Vishal Santoshi <[hidden email]> wrote:
|
Hi, Gary Yao Once I discovered that there was a change in the ip address[ jobmanager.rpc.address ]. From 10.208.73.129 to localhost. I think that will cause the issue. What do you think ? Cheers Minglei
|
By the way, in HA set up.
|
OK few things 2018-06-26 13:31:29 INFO CliFrontend:282 - Starting Command Line Client (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 @ 14:54:44 UTC) 2018-06-26 13:31:31 INFO ClientCnxn:876 - Socket connection established to zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, initiating session 2018-06-26 13:31:31 DEBUG ClientCnxn:949 - Session establishment request sent on zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181 2018-06-26 13:31:31 INFO ClientCnxn:1299 - Session establishment complete on server zk-f1fb95b9.bf2.tumblr.net/10.246.218.17:2181, sessionid = 0x35add547801ea07, negotiated timeout = 40000 2018-06-26 13:31:31 INFO RestClient:119 - Rest client endpoint started. 2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock. 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 2,3 replyHeader:: 2,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 2018-06-26 13:31:31 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 3,3 replyHeader:: 3,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 4,3 replyHeader:: 4,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 5,3 replyHeader:: 5,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/rest_server_lock serverPath:/flink_test/da_15/leader/rest_server_lock finished:false header:: 6,3 replyHeader:: 6,60416530560,-101 request:: '/flink_test/da_15/leader/rest_server_lock,T response:: 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 7,3 replyHeader:: 7,60416530560,0 request:: '/flink_test,F response:: s{47265479496,47265479496,1489163688703,1489163688703,0,2,0,0,0,2,60416492885} 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 8,3 replyHeader:: 8,60416530560,0 request:: '/flink_test/da_15,F response:: s{60416492885,60416492885,1529755199131,1529755199131,0,5,0,0,0,5,60416521584} 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 9,3 replyHeader:: 9,60416530560,0 request:: '/flink_test/da_15/leader,F response:: s{60416492887,60416492887,1529755199191,1529755199191,0,1,0,0,0,1,60416492888} 2018-06-26 13:31:31 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:/flink_test/da_15/leader/dispatcher_lock serverPath:/flink_test/da_15/leader/dispatcher_lock finished:false header:: 10,3 replyHeader:: 10,60416530560,-101 request:: '/flink_test/da_15/leader/dispatcher_lock,T response:: 2018-06-26 13:31:31 INFO CliFrontend:914 - Waiting for response... Waiting for response... 2018-06-26 13:31:44 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms 2018-06-26 13:31:58 DEBUG ClientCnxn:742 - Got ping response for sessionid: 0x35add547801ea07 after 0ms 2018-06-26 13:32:01 INFO RestClient:123 - Shutting down rest endpoint. 2018-06-26 13:32:01 INFO RestClient:140 - Rest endpoint shutdown complete. 2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock. 2018-06-26 13:32:01 INFO ZooKeeperLeaderRetrievalService:117 - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. 2018-06-26 13:32:01 DEBUG CuratorFrameworkImpl:282 - Closing 2018-06-26 13:32:01 INFO CuratorFrameworkImpl:821 - backgroundOperationsLoop exiting 2018-06-26 13:32:01 DEBUG CuratorZookeeperClient:199 - Closing 2018-06-26 13:32:01 DEBUG ConnectionState:115 - Closing 2018-06-26 13:32:01 DEBUG ZooKeeper:673 - Closing session: 0x35add547801ea07 2018-06-26 13:32:01 DEBUG ClientCnxn:1370 - Closing client for session: 0x35add547801ea07 2018-06-26 13:32:01 DEBUG ClientCnxn:843 - Reading reply sessionid:0x35add547801ea07, packet:: clientPath:null serverPath:null finished:false header:: 11,-11 replyHeader:: 11,60416530561,0 request:: null response:: null 2018-06-26 13:32:01 DEBUG ClientCnxn:1354 - Disconnecting client for session: 0x35add547801ea07 2018-06-26 13:32:01 INFO ZooKeeper:684 - Session: 0x35add547801ea07 closed 2018-06-26 13:32:01 INFO ClientCnxn:519 - EventThread shut down for session: 0x35add547801ea07 2018-06-26 13:32:01 DEBUG ClientCnxn:1146 - An exception was thrown while closing send thread for session 0x35add547801ea07 : Unable to read additional data from server sessionid 0x35add547801ea07, likely server has closed socket 2018-06-26 13:32:01 ERROR CliFrontend:891 - Error while running the command. org.apache.flink.util.FlinkException: Failed to retrieve job list. at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:429) at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:412) On Tue, Jun 26, 2018 at 5:43 AM, zhangminglei <[hidden email]> wrote:
|
The leader znode is the right one ( it is a binary ) get /flink_test/da_15/leader/00000000000000000000000000000000/job_manager_lock wFDakka.tcp://flink@...:22161/user/jobmanagersrjava.util.UUIDm/J leastSigBitsJ mostSigBitsxpHv So it does ( I think ) resolve the right leader of the HA, but from there ( the logs do not help as DEBUG logs do not expose what server it hits sadly ) . On Tue, Jun 26, 2018 at 9:57 AM, Vishal Santoshi <[hidden email]> wrote:
|
Hi Vishal,
You should check the contents of znode /flink_test/[...]/rest_server_lock to see if the URL is correct. The host and port should be logged by the RestClient [1]. If you do not see the message "Sending request of class [...]]" on DEBUG level, probably the client is not able to get the address from ZK in the first place. Lastly, the stacktrace you posted seems to be cut off. Can you attach the complete client logs? Best, Gary [1] https://github.com/apache/flink/blob/release-1.5.0-rc6/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L156 On Tue, Jun 26, 2018 at 4:09 PM, Vishal Santoshi <[hidden email]> wrote:
|
Ok, I will check. On Tue, Jun 26, 2018, 12:39 PM Gary Yao <[hidden email]> wrote:
|
In reply to this post by Chesnay Schepler
Chesnay, Do you have rough idea of the 1.5.1 timeline? Thanks, -- Christophe On Mon, Jun 25, 2018 at 4:22 PM, Chesnay Schepler <[hidden email]> wrote:
Christophe
|
I am not seeing rest_server_ [zk: localhost:2181(CONNECTED) 2] ls /flink_test/da_15 [jobgraphs, leader, checkpoints, leaderlatch, checkpoint-counter] The logs say 2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock. 2018-06-28 14:02:56 INFO ZooKeeperLeaderRetrievalService:100 - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. Is this a relative path, high-availability.zookeeper.path.root: /flink_test high-availability.cluster-id: /da_15 I do not see /leader/rest_server_lock both during the duration of the cli run ( or before or after ). I am a little stumped.... I do not see the above logs on 1.4 so am not sure whether /leader/rest_server_lock is the new code... On Thu, Jun 28, 2018 at 3:30 AM, Christophe Jolif <[hidden email]> wrote:
|
Hi Vishal,
The znode /flink_test/da_15/leader/rest_server_lock should exist as long as your Flink 1.5 cluster is running. In 1.4 this znode will not be created. Are you sure that the znode does not exist? Unfortunately you only attached the output of "ls /flink_test/da_15". Can you share the complete JobManager log files from a cluster that is (re-)starting? Best, Gary On Thu, Jun 28, 2018 at 4:10 PM, Vishal Santoshi <[hidden email]> wrote:
|
Thanks! I did not see the znode and thus did not paste the ls...anywaz will get you the full JM log ASAP.... On Thu, Jun 28, 2018, 5:35 PM Gary Yao <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |