,Hey We are using standalone flink on kubernetes :"And we have followed the instructions in the following link "Kubernetes HA Services https://ci.apache.org/projects/flink/flink-docs-stable/deployment/ha/kubernetes_ha.html .We were unable to make it work .We are facing a lot of problems For example some of the jobs don't start complaining that there are not enough slots available - although there are enough slots and it seems as the job manager is NOT aware of all the task managers .In other scenario we were unable to run any job at all The flink dashboard is unresponsive and we get the error "flink service temporarily unavailable due to an ongoing leader election. please refresh" .We believe we are missing some configurations ?Are there any more detailed instructions ?And suggestions/tips .Attached is the log of the job manager in one of the attempts Please give me some advice. BR, Danny jobmanager_log (3).txt (279K) Download Attachment |
Hi Daniel, what's the exact configuration you used? Did you use the resource definitions provided in the Standalone Flink on Kubernetes docs [1]? Did you do certain things differently in comparison to the documentation? Best, Matthias On Wed, Feb 10, 2021 at 1:31 PM Daniel Peled <[hidden email]> wrote:
|
One other thing: It looks like you've set high-availability.storageDir to a local path file:///opt/flink/recovery. You should use a storage path that is accessible from all Flink cluster components (e.g. using S3). Only references are stored in Kubernetes ConfigMaps [1]. On Wed, Feb 10, 2021 at 6:08 PM Matthias Pohl <[hidden email]> wrote:
Matthias Pohl | Engineer Follow us @VervericaData Ververica -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton Wehner |
I'm adding the Flink user ML to the conversation again. On Mon, Feb 15, 2021 at 8:18 AM Matthias Pohl <[hidden email]> wrote:
|
Hi Omar, I think Matthias is right. The K8s HA services create and edit config maps. Hence they need the rights to do this. In the native K8s documentation there is a section about how to create a service account with the right permissions [1]. I think that our K8s HA documentation currently lacks this part. I will create a PR to update the documentation. Cheers, Till On Mon, Feb 15, 2021 at 9:32 AM Matthias Pohl <[hidden email]> wrote:
|
If you are running a session cluster, then Flink will create a config map for every submitted job. These config maps will unfortunately only be cleaned up when you shut down the cluster. This is a known limitation which we want to fix soon [1, 2]. If you can help us with updating the documentation properly (e.g. which role binding to use for the service account with minimal permissions), then we would highly appreciate your help. Cheers, Till On Tue, Feb 16, 2021 at 3:45 PM Omer Ozery <[hidden email]> wrote:
|
Hi Omer, could you share a bit more of the logs with us? I would be interested in what has happened before "Stopping DefaultLeaderRetrievalService" is logged. One problem you might run into is FLINK-20417. This problem should be fixed with Flink 1.12.2. Cheers, Till On Thu, Feb 18, 2021 at 2:54 PM Omer Ozery <[hidden email]> wrote:
|
Hi Omer, thanks for the logs. Could you tell us a bit more about the concrete setup of your Flink K8s cluster? It looks to me as if the ResourceManager cannot talk to the JobMaster which tries to register at the RM. Also some JobMasters don't seem to reach the ResourceManager. Could it be that you are running standby JobManager processes? If this is the case, then it does not work that you are using a K8s service for the communication between Flink components. Cheers, Till On Sun, Feb 28, 2021 at 11:29 AM Omer Ozery <[hidden email]> wrote:
|
Hmm, this is strange. From the logs it looks as if certain communications between components don't arrive at the receiver's end. I think we have to further dig into the problem. In order to further narrow it down, could you try to start the cluster with using pod IPs instead of K8s services for inter component communications? You can see here how to configure it [1]. That way we make sure that it is not a problem of the K8s service. Cheers, Till On Mon, Mar 1, 2021, 21:42 Omer Ozery <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |