Failed job restart - flink on yarn
Posted by vprabhu@gmail.com on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html
Hi,
I have a flink streaming job that reads from kafka, performs a aggregation in a window, it ran fine for a while however when the number of events in a window crossed a certain limit , the yarn containers failed with Out Of Memory. The job was running with 10G containers.
We have about 64G memory on the machine and now I want to restart the job with a 20G container (we ran some tests and 20G should be good enough to accomodate all the elements from the window).
Is there a way to restart the job from the last checkpoint ?
When I resubmit the job, it starts from the last committed offsets however the events that were held in the window at the time of checkpointing seem to get lost. Is there a way to recover the events buffered within the window and were checkpointed before the failure ?
Thanks,
Prabhu