(DEPRECATED) Apache Flink User Mailing List archive.

Resource changed on src filesystem after upgrade

Classic

List

Threaded

4 messages Options

Mark Davis

Resource changed on src filesystem after upgrade

Hi all,

I am upgrading my DataSet jobs from Flink 1.8 to 1.12.

After the upgrade I started to receive the errors like this one:

14:12:57,441 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager - Worker container_e120_1608377880203_0751_01_000112 is terminated. Diagnostics: Resource hdfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971

java.io.IOException: Resourceh dfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971

at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:257)

at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)

at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)

at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)

at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:228)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:221)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:209)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

I understand it is somehow related to FLINK-12195, but this time it comes from the Hadoop code. I am running a very old version of the HDP platform v.2.6.5 so it might be the one to blame.

But the code was working perfectly fine before the upgrade, so I am confused.

Could you please advise.

Thank you!

Mark

Xintong Song

Re: Resource changed on src filesystem after upgrade

Hi Mark,

Two quick questions that might help us understand what's going on.

- Does this error happen for every of your dataset jobs? For a problematic job, does it happen for every container?

- What is the `jobs.jar`? Is it under `lib/`, `opt` of your client side filesystem, or specified as `yarn.ship-files`, `yarn.ship-archives` or `yarn.provided.lib.dirs`? This helps us to locate the code path that this file went through.

Thank you~

Xintong Song

On Sun, Jan 17, 2021 at 10:32 PM Mark Davis <[hidden email]> wrote:

Hi all,
I am upgrading my DataSet jobs from Flink 1.8 to 1.12.
After the upgrade I started to receive the errors like this one:

14:12:57,441 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager - Worker container_e120_1608377880203_0751_01_000112 is terminated. Diagnostics: Resource hdfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971
java.io.IOException: Resourceh dfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:257)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:228)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:221)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:209)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I understand it is somehow related to FLINK-12195, but this time it comes from the Hadoop code. I am running a very old version of the HDP platform v.2.6.5 so it might be the one to blame.
But the code was working perfectly fine before the upgrade, so I am confused.
Could you please advise.

Thank you!
Mark

Till Rohrmann

Re: Resource changed on src filesystem after upgrade

It would also help if you could send us the DEBUG logs of the run Mark. Including the logs from the client because they contain information about which timestamp is used for the upload. One more question which could help pinpointing the problem: Does the problem start occurring with Flink 1.10.0? My suspicion is that we might have broken something with the second PR for FLINK-8801 [1]. It looks that we no longer try to set the local timestamp via FileSystem.setTimes if we cannot fetch the remote timestamp. However, this should only be a problem for eventual consistent filesystems.

[1] https://issues.apache.org/jira/browse/FLINK-8801

Cheers,

Till

On Mon, Jan 18, 2021 at 11:04 AM Xintong Song <[hidden email]> wrote:

Hi Mark,

Two quick questions that might help us understand what's going on.
- Does this error happen for every of your dataset jobs? For a problematic job, does it happen for every container?
- What is the `jobs.jar`? Is it under `lib/`, `opt` of your client side filesystem, or specified as `yarn.ship-files`, `yarn.ship-archives` or `yarn.provided.lib.dirs`? This helps us to locate the code path that this file went through.

Thank you~
Xintong Song

On Sun, Jan 17, 2021 at 10:32 PM Mark Davis <[hidden email]> wrote:
Hi all,
I am upgrading my DataSet jobs from Flink 1.8 to 1.12.
After the upgrade I started to receive the errors like this one:

14:12:57,441 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager - Worker container_e120_1608377880203_0751_01_000112 is terminated. Diagnostics: Resource hdfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971
java.io.IOException: Resourceh dfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar changed on src filesystem (expected 1610892446439, was 1610892446971
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:257)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:228)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:221)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:209)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I understand it is somehow related to FLINK-12195, but this time it comes from the Hadoop code. I am running a very old version of the HDP platform v.2.6.5 so it might be the one to blame.
But the code was working perfectly fine before the upgrade, so I am confused.
Could you please advise.

Thank you!
Mark

Mark Davis

Re: Resource changed on src filesystem after upgrade

In reply to this post by Xintong Song

Hi Xintong Song,

- Does this error happen for every of your dataset jobs? For a problematic job, does it happen for every container?
- What is the `jobs.jar`? Is it under `lib/`, `opt` of your client side filesystem, or specified as `yarn.ship-files`, `yarn.ship-archives` or `yarn.provided.lib.dirs`? This helps us to locate the code path that this file went through.

I finally found the cause for the problem - I set both yarn.flink-dist-jar and pipeline.jars to the same archive (I submit jobs programmatically and repackage the Flink distribution because flink-dist.jar is not in the Central).

If I copy the file and refer jobs and distribution jars under different names the problem disappears.

My guess is that YARN (YarnApplicationFileUploader?) copies both files and if the filenames are the same the first file is overwritten by the second one and thus there is a a timestamp difference.

I guess a lot has changed since 1.8 in the YARN deployment area. Too bad there is no clear instruction how to submit a job programmatically every time I have to reverse engineer CliFrontend.

Sorry for the confusion and thanks!

Mark