Hello Flink, I have some questions regarding to the guideline on configuring restart strategy.
I was testing a job with the following setup:
When a TM got removed by k8s, it looked like that caused multiple failure to happen all at once. In the job manager log, I'm seeing different task failed with the same stacktrace 'Heartbeat of taskManager with id {SOME_ID} timed out' around the same time.
I understand that all the tasks that were running on this taskManager would fail. But still have these following questions:
Questions:
Thank you so much!
Jiahui
|
1) A restart in one region only
increments the count by 1, independent of how many tasks from that
region fail at the same time.
If tasks from different regions fail at
the same time, then the bound is incremented by the number of
affected regions.
2)
I would consider what failure rate is acceptable if there were no regions, and then multiple that but the number of slots to offset task executor failures.
Failures in the application (e.g., a source failing for some
reason) will generally behave, failure-rate wise, as if regions
would not exist. They are sporadic, and the chance of them
appearing in different regions at the same time seems rather
small. On 15/07/2020 00:16, Jiahui Jiang
wrote:
|
Free forum by Nabble | Edit this page |