When this happens, it appears that one of the workers fails but the rest of the workers continue to run. How would I be able to configure the app to be able to recover itself completely from the last successful checkpoint when this happens?
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, December 3, 2018 11:02 AM, Flink Developer <
[hidden email]> wrote:
I have a Flink app on 1.5.2 which sources data from Kafka topic (400 partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3 with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min. Checkpoint size is a few mb. After execution for a few days, I see:
Org.apache.flink.runtime.executiongraph.ExecutionGraph - Error in failover strategy - falling back to global restart
Java.lang.ClassCastException: com.amazonaws.services.s3.model.AmazonS3Exception cannot be cast to com.amazonaws.AmazonClientException
At org.apache.hadoop.fs.s3a.AWSClientIOException.getCause(AWSClientIOException.java:42)
At org.apache.flink.util.SerializedThrowable
At org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatus()
At org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
At akka.dispatch.Mailbox.exec(Mailbox.scala:234)
What causes the exception and why is the Flink job unable to recover? It states failing back to globsl restart? How can this be configured to recover properly? Is the checkloche interval/timeout too low? The Flink job's configuration shows Restart with fixed delay (0ms) #2147483647 restart attempts.