I've had similar problems when running Flink in Yarn. Flink task manager fails and it can't launch re-start jobs because there aren't enough slots and eventually Yarn decides to terminate Flink and you lose all your jobs & state because Flink regards it as a graceful shutdown. My latest attempt to solve the issue was to attempt to disable the vmem and pmem checks in yarn with the "yarn.nodemanager.pmem-check-enabled" and "yarn.nodemanager.vmem-check-enabled" settings. It's been ok so far, but I'm not totally sure if it was a good idea or not.
Of course, I'm not sure if that's the exact same problem you're having because I'm not sure if you're running Flink in Yarn or not.
-Shannon
On 4/14/17, 2:55 AM, "sohimankotia" <
[hidden email]> wrote: