Hi all, I am upgrading my DataSet jobs from Flink 1.8 to 1.12. After the upgrade I started to receive the errors like this one: 14:12:57,441 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager - Worker container_e120_1608377880203_0751_01_000112 is terminated. Diagnostics: Resource hdfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971 java.io.IOException: Resourceh dfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:257) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:228) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:221) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:209) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I understand it is somehow related to FLINK-12195, but this time it comes from the Hadoop code. I am running a very old version of the HDP platform v.2.6.5 so it might be the one to blame. But the code was working perfectly fine before the upgrade, so I am confused. Could you please advise. Thank you! Mark
|
Hi Mark, Two quick questions that might help us understand what's going on. - Does this error happen for every of your dataset jobs? For a problematic job, does it happen for every container? - What is the `jobs.jar`? Is it under `lib/`, `opt` of your client side filesystem, or specified as `yarn.ship-files`, `yarn.ship-archives` or `yarn.provided.lib.dirs`? This helps us to locate the code path that this file went through. Thank you~ Xintong Song On Sun, Jan 17, 2021 at 10:32 PM Mark Davis <[hidden email]> wrote:
|
It would also help if you could send us the DEBUG logs of the run Mark. Including the logs from the client because they contain information about which timestamp is used for the upload. One more question which could help pinpointing the problem: Does the problem start occurring with Flink 1.10.0? My suspicion is that we might have broken something with the second PR for FLINK-8801 [1]. It looks that we no longer try to set the local timestamp via FileSystem.setTimes if we cannot fetch the remote timestamp. However, this should only be a problem for eventual consistent filesystems. On Mon, Jan 18, 2021 at 11:04 AM Xintong Song <[hidden email]> wrote:
|
In reply to this post by Xintong Song
Hi Xintong Song,
I finally found the cause for the problem - I set both yarn.flink-dist-jar and pipeline.jars to the same archive (I submit jobs programmatically and repackage the Flink distribution because flink-dist.jar is not in the Central). If I copy the file and refer jobs and distribution jars under different names the problem disappears. My guess is that YARN (YarnApplicationFileUploader?) copies both files and if the filenames are the same the first file is overwritten by the second one and thus there is a a timestamp difference. I guess a lot has changed since 1.8 in the YARN deployment area. Too bad there is no clear instruction how to submit a job programmatically every time I have to reverse engineer CliFrontend. Sorry for the confusion and thanks! Mark
|
Free forum by Nabble | Edit this page |