Thanks for the reply. Well, tracing back to the root cause, I see the following:
1. At the Job manager, the Checkpoint times are getting worse :
Jobmanager :
Checkpoint times are getting worse progressively.
2017-09-16 05:05:50,813 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1 @ 1505538350809
2017-09-16 05:05:51,396 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1 (11101233 bytes in 586 ms).
2017-09-16 05:07:30,809 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 2 @ 1505538450809
2017-09-16 05:07:31,657 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 2 (18070955 bytes in 583 ms).
.
.
.
.
.
.
.
.
.
.
.
.
.
2017-09-16 07:32:58,117 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 89 (246125113 bytes in 27194 ms).
2017-09-16 07:34:10,809 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 90 @ 1505547250809
2017-09-16 07:34:44,932 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 90 (248272325 bytes in 34012 ms).
2017-09-16 07:35:50,809 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 91 @ 1505547350809
2017-09-16 07:36:37,058 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 91 (250348812 bytes in 46136 ms).
2017-09-16 07:37:30,809 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 92 @ 1505547450809
2017-09-16 07:38:18,076 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 92 (252399724 bytes in 47152 ms).
2017-09-16 07:39:10,809 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 93 @ 1505547550809
2017-09-16 07:40:13,494 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 93 (254374636 bytes in 62573 ms).
2017-09-16 07:40:50,809 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 94 @ 1505547650809
2017-09-16 07:42:42,850 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 94 (256386533 bytes in 111898 ms).
2017-09-16 07:42:42,850 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 95 @ 1505547762850
2017-09-16 07:46:06,241 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 95 (258441766 bytes in 203268 ms).
2017-09-16 07:46:06,241 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 96 @ 1505547966241
2017-09-16 07:48:42,069 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - KeyedCEPPatternOperator -> Map (1/4) (ff835faa9eb9182ed2f2230a1e5cc56d) switched from RUNNING to FAILED.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 96 for operator KeyedCEPPatternOperator -> Map (1/4).}
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 96 for operator KeyedCEPPatternOperator -> Map (1/4).
... 6 more
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
... 5 more
So, it looks like the Job Manager ran out of memory, thanks to the "Progressively Getting Worse" checkpoints. Any ideas on how to make sure the checkpoints faster?