Jobs running on a yarn per-job cluster fail to restart when a task manager is lost

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Jobs running on a yarn per-job cluster fail to restart when a task manager is lost

杨力
Hi,

I am running a streaming job without checkpointing enabled. A failute rate restart strategy have been set with StreamExecutionEvironment.setRestartStrategy.

When a task manager is lost because of memory problems, the job manager try to restart the job without launching a new task manager, and failed with NoResourceAvailableException: Not enough slots available to run the job.

The job is running on flink 1.4.2 and Hadoop 2.7.4.