Hi, I came across an issue during job submission via Flink Cli Client with Flink 1.7.1 in high availability mode. Setup: Flink version:: 1.7.1 Cluster:: K8s Mode:: High availability with 2 jobmanagers CLI Command ./bin/flink run -d -c MyExample /myexample.jar The CLI runs inside a K8s job and submits the Flink job to the Flink cluster. The K8s job spec allows it to try 3 times to submit the job. Result: 2019-09-11 22:32:12.908 [Flink-RestClusterClient-IO-thread-4] level=DEBUG org.apache.flink.runtime.rest.RestClient - Sending request of class class org.apache.flink.runtime.rest.messages.job.JobSubmitRequestBody
to job-jm-1.job-jm-svc.job-namespace.svc.cluster.local:8081/v1/jobs 2019-09-11 22:32:14.186 [flink-rest-client-netty-thread-1] level=ERROR org.apache.flink.runtime.rest.RestClient - Response was not valid JSON. org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input at [Source: org.apache.flink.shaded.netty4.io.netty.buffer.ByteBufInputStream@2b88f8bb; line: 1, column: 0] at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:256) at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3851) at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3792) at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2272) at org.apache.flink.runtime.rest.RestClient$ClientHandler.readRawResponse(RestClient.java:504) at org.apache.flink.runtime.rest.RestClient$ClientHandler.channelRead0(RestClient.java:452) ……… 2019-09-11 22:32:14.186 [flink-rest-client-netty-thread-1] level=ERROR org.apache.flink.runtime.rest.RestClient - Unexpected plain-text response: …….. The job submission fails after exhausting the number of retries. Observations: I looked into the debug logs & Flink code to come to below conclusions –
Open questions:
Can someone help check and confirm my observations above and help answer the questions? Highly appreciate your time and help. ~ Abhinav Bajaj CLI Logs
-
2019-09-11 22:30:31.077 [main-EventThread] level=DEBUG o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Leader node has changed. …… job-jm-0 Logs - job-jm-1 Logs -
|
Hi Abhinav, I think the problem is the following: Flink has been designed so that the cluster's rest endpoint does not need to run in the same process as the JobManager. However, currently the rest endpoint is started in the same process as the JobManagers. Because of the design one needs to announce the address of the rest endpoint and for that we use leader election (it is not strictly required that there is a leading rest endpoint but we use it mainly for service discovery). That's why you see that there is one leader with an http address and another with an akka address. The former is the leading rest endpoint and the latter is the leading JobManager. So somehow, the rest endpoint of pod-0 and the JobManager of pod-1 became leaders. Now what happens with Flink <= 1.7 is that the rest endpoint sends you a redirect response if the co-located JobManager (process-wise) is not the leader. The problem is that Flink's RestClusterClient does not properly handle the redirect responses. Starting from Flink >= 1.8, we removed the redirection logic and instead let Flink internally handle this by proxying all request to the actual leading JobManager. Hence, my recommendation would be to upgrade to a newer Flink version and see whether the problem still remains. Cheers, Till On Fri, Sep 13, 2019 at 4:30 AM Bajaj, Abhinav <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |