TM heartbeat timeout due to ResourceManager being busy

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

TM heartbeat timeout due to ResourceManager being busy

Paul Lam
Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts 
on cluster initialization and HDFS is slow at that moment. 

Apart from increasing the TM heartbeat timeout, is there any recommended  out of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam
Reply | Threaded
Open this post in threaded view
|

Re: TM heartbeat timeout due to ResourceManager being busy

Xintong Song
Hi Paul,

Thanks for reporting this.

Indeed, Flink's RM currently performs several HDFS operations in the rpc main thread when preparing the TM context, which may block the main thread when HDFS is slow.

Unfortunately, I don't see any out-of-box approach that fixes the problem at the moment, except for increasing the heartbeat timeout.

As for the long run solution, I think there's an easier approach. We can move creating of the TM contexts away from the rpc main thread. Ideally, we should try to avoid performing any heavy operations which do not modify the RM's internal states in the rpc main thread. With FLINK-19241, this can be achieved easily by delegating the work to the io executor.

Thank you~

Xintong Song



On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <[hidden email]> wrote:
Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts 
on cluster initialization and HDFS is slow at that moment. 

Apart from increasing the TM heartbeat timeout, is there any recommended  out of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam
Reply | Threaded
Open this post in threaded view
|

Re: TM heartbeat timeout due to ResourceManager being busy

Xintong Song
FYI, I just created FLINK-19568 for tracking this issue.


Thank you~

Xintong Song



On Mon, Oct 12, 2020 at 2:18 PM Xintong Song <[hidden email]> wrote:
Hi Paul,

Thanks for reporting this.

Indeed, Flink's RM currently performs several HDFS operations in the rpc main thread when preparing the TM context, which may block the main thread when HDFS is slow.

Unfortunately, I don't see any out-of-box approach that fixes the problem at the moment, except for increasing the heartbeat timeout.

As for the long run solution, I think there's an easier approach. We can move creating of the TM contexts away from the rpc main thread. Ideally, we should try to avoid performing any heavy operations which do not modify the RM's internal states in the rpc main thread. With FLINK-19241, this can be achieved easily by delegating the work to the io executor.

Thank you~

Xintong Song



On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <[hidden email]> wrote:
Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts 
on cluster initialization and HDFS is slow at that moment. 

Apart from increasing the TM heartbeat timeout, is there any recommended  out of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam
Reply | Threaded
Open this post in threaded view
|

Re: TM heartbeat timeout due to ResourceManager being busy

Paul Lam
In reply to this post by Xintong Song
Hi Xingtong,

Thanks a lot for the pointer!

It’s good to see there would be a new IO executor to take care of the TM contexts. Looking forward to the 1.12 release!

Best,
Paul Lam

2020年10月12日 14:18,Xintong Song <[hidden email]> 写道:

Hi Paul,

Thanks for reporting this.

Indeed, Flink's RM currently performs several HDFS operations in the rpc main thread when preparing the TM context, which may block the main thread when HDFS is slow.

Unfortunately, I don't see any out-of-box approach that fixes the problem at the moment, except for increasing the heartbeat timeout.

As for the long run solution, I think there's an easier approach. We can move creating of the TM contexts away from the rpc main thread. Ideally, we should try to avoid performing any heavy operations which do not modify the RM's internal states in the rpc main thread. With FLINK-19241, this can be achieved easily by delegating the work to the io executor.

Thank you~
Xintong Song



On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <[hidden email]> wrote:
Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts 
on cluster initialization and HDFS is slow at that moment. 

Apart from increasing the TM heartbeat timeout, is there any recommended  out of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam

Reply | Threaded
Open this post in threaded view
|

Re: TM heartbeat timeout due to ResourceManager being busy

Paul Lam
Sorry for the misspelled name, Xintong

Best,
Paul Lam

2020年10月12日 14:46,Paul Lam <[hidden email]> 写道:

Hi Xingtong,

Thanks a lot for the pointer!

It’s good to see there would be a new IO executor to take care of the TM contexts. Looking forward to the 1.12 release!

Best,
Paul Lam

2020年10月12日 14:18,Xintong Song <[hidden email]> 写道:

Hi Paul,

Thanks for reporting this.

Indeed, Flink's RM currently performs several HDFS operations in the rpc main thread when preparing the TM context, which may block the main thread when HDFS is slow.

Unfortunately, I don't see any out-of-box approach that fixes the problem at the moment, except for increasing the heartbeat timeout.

As for the long run solution, I think there's an easier approach. We can move creating of the TM contexts away from the rpc main thread. Ideally, we should try to avoid performing any heavy operations which do not modify the RM's internal states in the rpc main thread. With FLINK-19241, this can be achieved easily by delegating the work to the io executor.

Thank you~
Xintong Song



On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <[hidden email]> wrote:
Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts 
on cluster initialization and HDFS is slow at that moment. 

Apart from increasing the TM heartbeat timeout, is there any recommended  out of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam


Reply | Threaded
Open this post in threaded view
|

Re: TM heartbeat timeout due to ResourceManager being busy

Xintong Song
No worries :)


Thank you~

Xintong Song



On Mon, Oct 12, 2020 at 2:48 PM Paul Lam <[hidden email]> wrote:
Sorry for the misspelled name, Xintong

Best,
Paul Lam

2020年10月12日 14:46,Paul Lam <[hidden email]> 写道:

Hi Xingtong,

Thanks a lot for the pointer!

It’s good to see there would be a new IO executor to take care of the TM contexts. Looking forward to the 1.12 release!

Best,
Paul Lam

2020年10月12日 14:18,Xintong Song <[hidden email]> 写道:

Hi Paul,

Thanks for reporting this.

Indeed, Flink's RM currently performs several HDFS operations in the rpc main thread when preparing the TM context, which may block the main thread when HDFS is slow.

Unfortunately, I don't see any out-of-box approach that fixes the problem at the moment, except for increasing the heartbeat timeout.

As for the long run solution, I think there's an easier approach. We can move creating of the TM contexts away from the rpc main thread. Ideally, we should try to avoid performing any heavy operations which do not modify the RM's internal states in the rpc main thread. With FLINK-19241, this can be achieved easily by delegating the work to the io executor.

Thank you~
Xintong Song



On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <[hidden email]> wrote:
Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts 
on cluster initialization and HDFS is slow at that moment. 

Apart from increasing the TM heartbeat timeout, is there any recommended  out of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam