Hi, I am looking into the cause YARN starts new application attempt on Flink 1.5.2. The challenge is getting the logs for the first attempt. After checking YARN I discovered that in the first attempt and the second one application manager (job manager) gets assigned the same container id (is this expected ?) In this case logs from the first attempt are overwritten? I found that setKeepContainersAcrossApplicationAttempts is enabled here here The second challenge is understanding if the job will be restored into new application attempts or new application attempt will just have flink running without any job? Regards, Pawel First attempt: pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list appattempt_1538570922803_0020_000001 18/10/08 10:16:16 INFO client.RMProxy: Connecting to ResourceManager at ip-10-4-X-X.eu-west-1.compute.internal/10.4.108.26:8032 Total number of containers :1 Container-Id Start Time Finish Time State Host Node Http Address LOG-URL container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018 N/A RUNNING ip-10-4-X-X.eu-west-1.compute.internal:8041 http://ip-10-4-X-X.eu-west-1.compute.internal:8042 http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek Second attempt: [pawel_bartoszek@ip-10-4-X-X ~]$ yarn container -list appattempt_1538570922803_0020_000002 18/10/08 10:16:37 INFO client.RMProxy: Connecting to ResourceManager at ip-10-4-X-X.eu-west-1.compute.internal/10.4.X.X:8032 Total number of containers :1 Container-Id Start Time Finish Time State Host Node Http Address LOG-URL container_1538570922803_0020_02_000001 Mon Oct 08 09:47:17 +0000 2018 N/A RUNNING ip-10-4-X-X.eu-west-1.compute.internal:8041 http://ip-10-4-X-X.eu-west-1.compute.internal:8042 http://ip-10-4-X-X.eu-west-1.compute.internal:8042/node/containerlogs/container_1538570922803_0020_02_000001/pawel_bartoszek |
Hi Pawel, As far as I know, the application attempt is incremented if the application master fails and a new one is brought up. Therefore, what you are seeing should not happen. I have just deployed on AWS EMR 5.17.0 (Hadoop 2.8.4) and killed the container running the application master – the container id was not reused. Can you describe how to reproduce this behavior? Do you have a sample application? Can you observe this behavior consistently? Can you share the complete output of yarn logs -applicationId <YOUR_APPLICATION_ID>? The call to the method setKeepContainersAcrossApplicationAttempts is needed to enable recovery of previously allocated TaskManager containers [1]. I currently do not see how it is possible to keep the AM container across application attempts. > The second challenge is understanding if the job will be restored into new > application attempts or new application attempt will just have flink running > without any job? The job will be restored if you have HA enabled [2][3]. Best, Gary [1] https://hortonworks.com/blog/apache-hadoop-yarn-hdp-2-2-fault-tolerance-features-long-running-services/ [2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/jobmanager_high_availability.html#yarn-cluster-high-availability [3] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn On Mon, Oct 8, 2018 at 12:32 PM Pawel Bartoszek <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |