Completed job wasn't saved to archive

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Completed job wasn't saved to archive

Pavel Potseluev
Hi all,
 
We see occasionally that flink doesn't save information about canceled job to archive directory (configured by jobmanager.archive.fs.dir property). And there are no exceptions in the log about failing archiving. It's a problem in our use case because our script for deploying jobs relies on flink history server to find latest checkpoint for some job. Does flink guarantee saving data to archive? If so, any ideas why it doesn't work sometimes? Flink version is 1.8.0.
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
Reply | Threaded
Open this post in threaded view
|

Re: Completed job wasn't saved to archive

vino yang
If everything is OK(your config options about archive dir and history server is correct), Flink should archive the completed job.

You said you did not find any exceptions in the log about failing to archive. But any other exceptions? Can you share the logs about your scene?

Best,
Vino

Pavel Potseluev <[hidden email]> 于2019年11月21日周四 上午2:25写道:
Hi all,
 
We see occasionally that flink doesn't save information about canceled job to archive directory (configured by jobmanager.archive.fs.dir property). And there are no exceptions in the log about failing archiving. It's a problem in our use case because our script for deploying jobs relies on flink history server to find latest checkpoint for some job. Does flink guarantee saving data to archive? If so, any ideas why it doesn't work sometimes? Flink version is 1.8.0.
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
Reply | Threaded
Open this post in threaded view
|

Re: Completed job wasn't saved to archive

Pavel Potseluev
Hi Vino,
 
Usually Flink archives jobs correctly and the problem is rarely reproduced. So I think it isn't a problem with configuration.
 
Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled:

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1872 @ 1574092333218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260 [flink-akka.actor.default-dispatcher-30] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes in 23541 ms).

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1873 @ 1574092393218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to CANCELLING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [cluster-io-thread-6] INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint  - Checkpoint with ID 1872 at 's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872' not discarded.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [0553df66161f5d78f4b41d8c8c32c21f].

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [498b9bf0c0f2188ff739d72e6df288dc].

 
21.11.2019, 06:07, "vino yang" <[hidden email]>:
If everything is OK(your config options about archive dir and history server is correct), Flink should archive the completed job.
 
You said you did not find any exceptions in the log about failing to archive. But any other exceptions? Can you share the logs about your scene?
 
Best,
Vino
 
Pavel Potseluev <[hidden email]> 于2019年11月21日周四 上午2:25写道:
Hi all,
 
We see occasionally that flink doesn't save information about canceled job to archive directory (configured by jobmanager.archive.fs.dir property). And there are no exceptions in the log about failing archiving. It's a problem in our use case because our script for deploying jobs relies on flink history server to find latest checkpoint for some job. Does flink guarantee saving data to archive? If so, any ideas why it doesn't work sometimes? Flink version is 1.8.0.
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
Reply | Threaded
Open this post in threaded view
|

Re: Completed job wasn't saved to archive

Chesnay Schepler
If the archiving fails there should be some log message, like "Failed to archive job" or "Could not archive completed job..." .
If nothing of the sort is logged my first instinct would be that the operation is being slowed down, _a lot_.

Where are you archiving them to? Could it be the write operation is being throttled heavily?

On 21/11/2019 13:48, Pavel Potseluev wrote:
Hi Vino,
 
Usually Flink archives jobs correctly and the problem is rarely reproduced. So I think it isn't a problem with configuration.
 
Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled:

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1872 @ 1574092333218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260 [flink-akka.actor.default-dispatcher-30] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes in 23541 ms).

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1873 @ 1574092393218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to CANCELLING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [cluster-io-thread-6] INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint  - Checkpoint with ID 1872 at 's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872' not discarded.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [0553df66161f5d78f4b41d8c8c32c21f].

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [498b9bf0c0f2188ff739d72e6df288dc].

 
21.11.2019, 06:07, "vino yang" [hidden email]:
If everything is OK(your config options about archive dir and history server is correct), Flink should archive the completed job.
 
You said you did not find any exceptions in the log about failing to archive. But any other exceptions? Can you share the logs about your scene?
 
Best,
Vino
 
Pavel Potseluev <[hidden email]> 于2019年11月21日周四 上午2:25写道:
Hi all,
 
We see occasionally that flink doesn't save information about canceled job to archive directory (configured by jobmanager.archive.fs.dir property). And there are no exceptions in the log about failing archiving. It's a problem in our use case because our script for deploying jobs relies on flink history server to find latest checkpoint for some job. Does flink guarantee saving data to archive? If so, any ideas why it doesn't work sometimes? Flink version is 1.8.0.
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 


Reply | Threaded
Open this post in threaded view
|

Re: Completed job wasn't saved to archive

Pavel Potseluev
Hi Chesnay,
 
We archive jobs on s3 file system. We don't configure a throttling for write operations and afaik it isn't possible now and will be implemented in FLINK-13251. And other write operations (like checkpoints saving) work fine. But I don't see archived job or message about archiving failure at all. It looks like Flink just didn't try to save job to archive.
 
21.11.2019, 17:17, "Chesnay Schepler" <[hidden email]>:
If the archiving fails there should be some log message, like "Failed to archive job" or "Could not archive completed job..." .
If nothing of the sort is logged my first instinct would be that the operation is being slowed down, _a lot_.
 
Where are you archiving them to? Could it be the write operation is being throttled heavily?
 
On 21/11/2019 13:48, Pavel Potseluev wrote:
Hi Vino,
 
Usually Flink archives jobs correctly and the problem is rarely reproduced. So I think it isn't a problem with configuration.
 
Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled:

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1872 @ 1574092333218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260 [flink-akka.actor.default-dispatcher-30] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes in 23541 ms).

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1873 @ 1574092393218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to CANCELLING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [cluster-io-thread-6] INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint  - Checkpoint with ID 1872 at 's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872' not discarded.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [0553df66161f5d78f4b41d8c8c32c21f].

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [498b9bf0c0f2188ff739d72e6df288dc].

 
21.11.2019, 06:07, "vino yang" [hidden email]:
If everything is OK(your config options about archive dir and history server is correct), Flink should archive the completed job.
 
You said you did not find any exceptions in the log about failing to archive. But any other exceptions? Can you share the logs about your scene?
 
Best,
Vino
 
Pavel Potseluev <[hidden email]> 于2019年11月21日周四 上午2:25写道:
Hi all,
 
We see occasionally that flink doesn't save information about canceled job to archive directory (configured by jobmanager.archive.fs.dir property). And there are no exceptions in the log about failing archiving. It's a problem in our use case because our script for deploying jobs relies on flink history server to find latest checkpoint for some job. Does flink guarantee saving data to archive? If so, any ideas why it doesn't work sometimes? Flink version is 1.8.0.
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 

 

 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
Reply | Threaded
Open this post in threaded view
|

Re: Completed job wasn't saved to archive

Chesnay Schepler
I'm afraid I can't think of a solution. I don't see a way how this operation can succeed or fail without anything being logged.

Is the cluster behaving normally afterwards? Could you check whether the numRunningJobs ticks down properly after the job was canceled?


On 22/11/2019 13:27, Pavel Potseluev wrote:
Hi Chesnay,
 
We archive jobs on s3 file system. We don't configure a throttling for write operations and afaik it isn't possible now and will be implemented in FLINK-13251. And other write operations (like checkpoints saving) work fine. But I don't see archived job or message about archiving failure at all. It looks like Flink just didn't try to save job to archive.
 
21.11.2019, 17:17, "Chesnay Schepler" [hidden email]:
If the archiving fails there should be some log message, like "Failed to archive job" or "Could not archive completed job..." .
If nothing of the sort is logged my first instinct would be that the operation is being slowed down, _a lot_.
 
Where are you archiving them to? Could it be the write operation is being throttled heavily?
 
On 21/11/2019 13:48, Pavel Potseluev wrote:
Hi Vino,
 
Usually Flink archives jobs correctly and the problem is rarely reproduced. So I think it isn't a problem with configuration.
 
Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled:

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1872 @ 1574092333218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260 [flink-akka.actor.default-dispatcher-30] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes in 23541 ms).

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1873 @ 1574092393218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to CANCELLING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [cluster-io-thread-6] INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint  - Checkpoint with ID 1872 at 's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872' not discarded.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [0553df66161f5d78f4b41d8c8c32c21f].

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [498b9bf0c0f2188ff739d72e6df288dc].

 
21.11.2019, 06:07, "vino yang" [hidden email]:
If everything is OK(your config options about archive dir and history server is correct), Flink should archive the completed job.
 
You said you did not find any exceptions in the log about failing to archive. But any other exceptions? Can you share the logs about your scene?
 
Best,
Vino
 
Pavel Potseluev <[hidden email]> 于2019年11月21日周四 上午2:25写道:
Hi all,
 
We see occasionally that flink doesn't save information about canceled job to archive directory (configured by jobmanager.archive.fs.dir property). And there are no exceptions in the log about failing archiving. It's a problem in our use case because our script for deploying jobs relies on flink history server to find latest checkpoint for some job. Does flink guarantee saving data to archive? If so, any ideas why it doesn't work sometimes? Flink version is 1.8.0.
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 

 

 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 


Reply | Threaded
Open this post in threaded view
|

Re: Completed job wasn't saved to archive

Rong Rong
Hi Pavel,

Sorry for bringing this thread up so late. I was digging into the usage of the Flink history server and I found one situation where there would be no logs and no failure/success message from the cluster:
In very rare case in our Flink-YARN session cluster: if an application master (JobManager running container) fails and being restarted as a YARN 2nd attempt (we haven't enable HA) - then there will be no logs of archiving being logged whatsoever. However in this case the there would be a completely new AM container brought up running the JM again (e.g. new log files)

I am not exactly sure whether this suites your scenarios. Could you describe a bit more on how your cluster was configured?

Thanks,
Rong

On Mon, Nov 25, 2019 at 10:49 AM Chesnay Schepler <[hidden email]> wrote:
I'm afraid I can't think of a solution. I don't see a way how this operation can succeed or fail without anything being logged.

Is the cluster behaving normally afterwards? Could you check whether the numRunningJobs ticks down properly after the job was canceled?


On 22/11/2019 13:27, Pavel Potseluev wrote:
Hi Chesnay,
 
We archive jobs on s3 file system. We don't configure a throttling for write operations and afaik it isn't possible now and will be implemented in FLINK-13251. And other write operations (like checkpoints saving) work fine. But I don't see archived job or message about archiving failure at all. It looks like Flink just didn't try to save job to archive.
 
21.11.2019, 17:17, "Chesnay Schepler" [hidden email]:
If the archiving fails there should be some log message, like "Failed to archive job" or "Could not archive completed job..." .
If nothing of the sort is logged my first instinct would be that the operation is being slowed down, _a lot_.
 
Where are you archiving them to? Could it be the write operation is being throttled heavily?
 
On 21/11/2019 13:48, Pavel Potseluev wrote:
Hi Vino,
 
Usually Flink archives jobs correctly and the problem is rarely reproduced. So I think it isn't a problem with configuration.
 
Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled:

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1872 @ 1574092333218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260 [flink-akka.actor.default-dispatcher-30] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes in 23541 ms).

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering checkpoint 1873 @ 1574092393218 for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to CANCELLING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (2/2) (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - user-offers-statistics-init-from-file -> Map (1/2) (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING to CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping checkpoint coordinator for job 5ec264a20bb5005cdbd8e23a5e59f136.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO  o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 [cluster-io-thread-6] INFO  org.apache.flink.runtime.checkpoint.CompletedCheckpoint  - Checkpoint with ID 1872 at 's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872' not discarded.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045 [flink-akka.actor.default-dispatcher-2] INFO  o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259 [flink-akka.actor.default-dispatcher-2] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state CANCELED.

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [0553df66161f5d78f4b41d8c8c32c21f].

771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle slot [498b9bf0c0f2188ff739d72e6df288dc].

 
21.11.2019, 06:07, "vino yang" [hidden email]:
If everything is OK(your config options about archive dir and history server is correct), Flink should archive the completed job.
 
You said you did not find any exceptions in the log about failing to archive. But any other exceptions? Can you share the logs about your scene?
 
Best,
Vino
 
Pavel Potseluev <[hidden email]> 于2019年11月21日周四 上午2:25写道:
Hi all,
 
We see occasionally that flink doesn't save information about canceled job to archive directory (configured by jobmanager.archive.fs.dir property). And there are no exceptions in the log about failing archiving. It's a problem in our use case because our script for deploying jobs relies on flink history server to find latest checkpoint for some job. Does flink guarantee saving data to archive? If so, any ideas why it doesn't work sometimes? Flink version is 1.8.0.
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 
 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC
 

 

 
 
-- 
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC