Hello all, I'm building a standalone cluster with HA JobManager. So far, everything seems to work, but when i try to `flink run` my job, it fails with the following error:Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway. Any ideas where I should look? -- Julio Biason, Sofware Engineer AZION | Deliver. Accelerate. Protect. Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554 |
Hey guys and gals, So, after a bit more digging, I found out that once HA is enabled, `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`, but I was expecting this). Because I set the `high-availability.jobmanager.port` to `50010-50015`, my RPC port also changed (the docs made me think this would only affect the HA communication, not ALL communications). This can be checked on the Dashboard, under the JobManager configuration option.2018-05-02 16:44:32,373 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId: 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected leader session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal the received leader session ID 00000000-0000-0000-0000-000000000000. Failure when using `flink run`: org.apache.flink.client.program.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524) at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456) at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:402) at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802) at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054) at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101) at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098) Caused by: org.apache.flink.runtime.client.JobTimeoutException: JobManager did not respond within 60000 ms at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:437) at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:516) ... 14 more Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:435) ... 15 more On Wed, May 2, 2018 at 9:52 AM, Julio Biason <[hidden email]> wrote:
-- Julio Biason, Sofware Engineer AZION | Deliver. Accelerate. Protect. Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554 |
Hi Julio, Are you using the -m flag of "bin/flink run" by any chance? In HA mode, you cannot manually specify the JobManager address. The client determines the leader through ZooKeeper. If you did not configure the ZooKeeper quorum in the flink-conf.yaml on the machine from which you are submitting, this might explain the error message. > But that didn't solve my problem. So far, the `flink run` still fails with the same message (I'm adding the full stacktrace of the failure in the end, just in case), but now I'm also seeing this message in the JobManager logs: Unfortunately, the error message in your previous email is different. If the above does not solve your problem, can you attach the logs of the client and JobManager? Lastly, what Flink version are you running? Best, Gary On Wed, May 2, 2018 at 6:51 PM, Julio Biason <[hidden email]> wrote:
|
Hey Gary, Yes, I was still running with the `-m` flag on my dev machine -- partially configured like prod, but without the HA stuff. I never thought it could be a problem, since even the web interface can redirect from the secondary back to primary. On Thu, May 3, 2018 at 9:36 AM, Gary Yao <[hidden email]> wrote:
-- Julio Biason, Sofware Engineer AZION | Deliver. Accelerate. Protect. Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554 |
Hey Gary (again), Yup, that worked. Now I can launch apps again.On Thu, May 3, 2018 at 11:00 AM, Julio Biason <[hidden email]> wrote:
-- Julio Biason, Sofware Engineer AZION | Deliver. Accelerate. Protect. Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554 |
Hi Julio, I agree that the job submission should work in HA mode if you manually specify the JobManager. At the minimum a proper error message should be shown. Feel free to open an issue in JIRA. You already stated that you can maintain multiple configuration directories as a workaround. It is possible to switch between them by setting the FLINK_CONF_DIR environment variable, e.g, FLINK_CONF_DIR=/path/to/conf-dir-1 bin/flink run ... FLINK_CONF_DIR=/path/to/conf-dir-2 bin/flink run ... Beginning from 1.5 this should be a non-issue because the job submission happens through HTTP and every non-leading master redirects requests to the leading master. Best, Gary On Thu, May 3, 2018 at 10:23 PM, Julio Biason <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |