Maybe a flink bug. Job keeps in FAILING state

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Maybe a flink bug. Job keeps in FAILING state

Joshua Fan
Hi All,
There is a topology of 3 operator, such as, source, parser, and persist. Occasionally, 5 subtasks of the source encounters exception and turns to failed, at the same time, one subtask of the parser runs into exception and turns to failed too. The jobmaster gets a message of the parser's failed. The jobmaster then try to cancel all the subtask, most of the subtasks of the three operator turns to canceled except the 5 subtasks of the source, because the state of the 5 ones is already FAILED before jobmaster try to cancel it. Then the jobmaster can not reach a final state but keeps in  Failing state meanwhile the subtask of the source kees in canceling state. 

The job run on a flink 1.7 cluster on yarn, and there is only one tm with 10 slots.

The attached files contains a jm log , tm log and the ui picture.

The exception timestamp is about 2019-06-16 13:42:28.

Yours
Joshua

20190618104945417.jpg (89K) Download Attachment
jobmanager.log.2019-06-16.0 (1004K) Download Attachment
taskmanager.log (10M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Chesnay Schepler
@Till have you see something like this before? Despite all source tasks
reaching a terminal state on a TM (FAILED) it does not send updates to
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:

> Hi All,
> There is a topology of 3 operator, such as, source, parser, and
> persist. Occasionally, 5 subtasks of the source encounters exception
> and turns to failed, at the same time, one subtask of the parser runs
> into exception and turns to failed too. The jobmaster gets a message
> of the parser's failed. The jobmaster then try to cancel all the
> subtask, most of the subtasks of the three operator turns to canceled
> except the 5 subtasks of the source, because the state of the 5 ones
> is already FAILED before jobmaster try to cancel it. Then the
> jobmaster can not reach a final state but keeps in  Failing state
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua


Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Zhijiang(wangzhijiang999)
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:

> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua

Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Joshua Fan
zhijiang

I did not capture the job ui, the topology is in FAILING state, but the persistentbolt subtasks as can be seen in the picture attached in first mail was all canceled, and the parsebolt subtasks as described before had one subtask FAILED, other subtasks CANCELED, but the source subtasks had one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.

The subtask status described above is in jm view, but in tm view, all of the source subtask was in FAILED, do not know why jm was not notify about this.

As all of the failed status was triggered by a oom by the subtask can not create native thread when checkpointing, I also dumped the stack of the jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) are still active after it throwed a oom and was called to cancel . I attached the jstack file in this email.

Yours sincerely
Joshua

On Wed, Jun 19, 2019 at 4:40 PM zhijiang <[hidden email]> wrote:
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:

> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua


tm.jstack (240K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Zhijiang(wangzhijiang999)
Hi Joshua,

If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really in CANCELED state on TM side, but in CANCELING state on JM side, then it might indicates the terminal state RPC was not received by JM. I am not sure whether the OOM would cause this issue happen resulting in unexpected behavior.

In addition, you mentioned these tasks are still active after OOM and was called to cancel, so I am not sure what is the specific periods for your attached TM stack. I think it might provide help if you could provide corresponding TM log and JM log. 
From TM log it is easy to check the task final state. 

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan <[hidden email]>
Send Time:2019年6月20日(星期四) 11:55
To:zhijiang <[hidden email]>
Cc:user <[hidden email]>; Till Rohrmann <[hidden email]>; Chesnay Schepler <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

zhijiang

I did not capture the job ui, the topology is in FAILING state, but the persistentbolt subtasks as can be seen in the picture attached in first mail was all canceled, and the parsebolt subtasks as described before had one subtask FAILED, other subtasks CANCELED, but the source subtasks had one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.

The subtask status described above is in jm view, but in tm view, all of the source subtask was in FAILED, do not know why jm was not notify about this.

As all of the failed status was triggered by a oom by the subtask can not create native thread when checkpointing, I also dumped the stack of the jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) are still active after it throwed a oom and was called to cancel . I attached the jstack file in this email.

Yours sincerely
Joshua

On Wed, Jun 19, 2019 at 4:40 PM zhijiang <[hidden email]> wrote:
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:

> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua


Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Chesnay Schepler
The logs are attached to the initial mail.

Echoing my thoughts from earlier: from the logs it looks as if the TM never even submits the terminal state RPC calls for several tasks to the JM.

On 21/06/2019 10:30, zhijiang wrote:
Hi Joshua,

If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really in CANCELED state on TM side, but in CANCELING state on JM side, then it might indicates the terminal state RPC was not received by JM. I am not sure whether the OOM would cause this issue happen resulting in unexpected behavior.

In addition, you mentioned these tasks are still active after OOM and was called to cancel, so I am not sure what is the specific periods for your attached TM stack. I think it might provide help if you could provide corresponding TM log and JM log. 
From TM log it is easy to check the task final state. 

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan [hidden email]
Send Time:2019年6月20日(星期四) 11:55
To:zhijiang [hidden email]
Cc:user [hidden email]; Till Rohrmann [hidden email]; Chesnay Schepler [hidden email]
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

zhijiang

I did not capture the job ui, the topology is in FAILING state, but the persistentbolt subtasks as can be seen in the picture attached in first mail was all canceled, and the parsebolt subtasks as described before had one subtask FAILED, other subtasks CANCELED, but the source subtasks had one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.

The subtask status described above is in jm view, but in tm view, all of the source subtask was in FAILED, do not know why jm was not notify about this.

As all of the failed status was triggered by a oom by the subtask can not create native thread when checkpointing, I also dumped the stack of the jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) are still active after it throwed a oom and was called to cancel . I attached the jstack file in this email.

Yours sincerely
Joshua

On Wed, Jun 19, 2019 at 4:40 PM zhijiang <[hidden email]> wrote:
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:
> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua



Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Zhijiang(wangzhijiang999)
Thanks for the reminding @Chesnay Schepler .

I just looked throught the related logs. Actually all the five "Source: ServiceLog" tasks are not in terminal state on JM view, the relevant processes are as follows:

1. The checkpoint in task causes OOM issue which would call `Task#failExternally` as a result, we could see the log "Attempting to fail task externally" in tm.
2. The source task would transform state from RUNNING to FAILED and then starts a canceler thread for canceling task, we could see log "Triggering cancellation of task" in tm.
3. When JM starts to cancel the source tasks, the rpc call `Task#cancelExecution` would find the task was already in FAILED state as above step 2, we could see log "Attempting to cancel task" in tm.

At last all the five source tasks are not in terminal states from jm log, I guess the step 2 might not create canceler thread successfully, because the root failover was caused by OOM during creating native thread in step1, so it might exist possibilities that createing canceler thread is not successful as well in OOM case which is unstable. If so, the source task would not been interrupted at all, then it would not report to JM as well, but the state is already changed to FAILED before. 

For the other vertex tasks, it does not trigger `Task#failExternally` in step 1, and only receives the cancel rpc from JM in step 3. And I guess at this time later than the source period, the canceler thread could be created succesfully after some GCs, then these tasks could be canceled as reported to JM side.

I think the key problem is under OOM case some behaviors are not within expectations, so it might bring problems. Maybe we should handle OOM error in extreme way like making TM exit to solve the potential issue.

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月21日(星期五) 16:34
To:zhijiang <[hidden email]>; Joshua Fan <[hidden email]>
Cc:user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

The logs are attached to the initial mail.

Echoing my thoughts from earlier: from the logs it looks as if the TM never even submits the terminal state RPC calls for several tasks to the JM.

On 21/06/2019 10:30, zhijiang wrote:
Hi Joshua,

If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really in CANCELED state on TM side, but in CANCELING state on JM side, then it might indicates the terminal state RPC was not received by JM. I am not sure whether the OOM would cause this issue happen resulting in unexpected behavior.

In addition, you mentioned these tasks are still active after OOM and was called to cancel, so I am not sure what is the specific periods for your attached TM stack. I think it might provide help if you could provide corresponding TM log and JM log. 
From TM log it is easy to check the task final state. 

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan [hidden email]
Send Time:2019年6月20日(星期四) 11:55
To:zhijiang [hidden email]
Cc:user [hidden email]; Till Rohrmann [hidden email]; Chesnay Schepler [hidden email]
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

zhijiang

I did not capture the job ui, the topology is in FAILING state, but the persistentbolt subtasks as can be seen in the picture attached in first mail was all canceled, and the parsebolt subtasks as described before had one subtask FAILED, other subtasks CANCELED, but the source subtasks had one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.

The subtask status described above is in jm view, but in tm view, all of the source subtask was in FAILED, do not know why jm was not notify about this.

As all of the failed status was triggered by a oom by the subtask can not create native thread when checkpointing, I also dumped the stack of the jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) are still active after it throwed a oom and was called to cancel . I attached the jstack file in this email.

Yours sincerely
Joshua

On Wed, Jun 19, 2019 at 4:40 PM zhijiang <[hidden email]> wrote:
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:
> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua




Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Joshua Fan
Hi Zhijiang

Thank you for your analysis. I agree with it. The solution may be to let tm exit like you mentioned when any type of oom occurs, because the flink has no control on a tm when a oom occurs.


Don't know it is worth to fix.

Thank you all.

Yours sincerely
Joshua

On Fri, Jun 21, 2019 at 5:32 PM zhijiang <[hidden email]> wrote:
Thanks for the reminding @Chesnay Schepler .

I just looked throught the related logs. Actually all the five "Source: ServiceLog" tasks are not in terminal state on JM view, the relevant processes are as follows:

1. The checkpoint in task causes OOM issue which would call `Task#failExternally` as a result, we could see the log "Attempting to fail task externally" in tm.
2. The source task would transform state from RUNNING to FAILED and then starts a canceler thread for canceling task, we could see log "Triggering cancellation of task" in tm.
3. When JM starts to cancel the source tasks, the rpc call `Task#cancelExecution` would find the task was already in FAILED state as above step 2, we could see log "Attempting to cancel task" in tm.

At last all the five source tasks are not in terminal states from jm log, I guess the step 2 might not create canceler thread successfully, because the root failover was caused by OOM during creating native thread in step1, so it might exist possibilities that createing canceler thread is not successful as well in OOM case which is unstable. If so, the source task would not been interrupted at all, then it would not report to JM as well, but the state is already changed to FAILED before. 

For the other vertex tasks, it does not trigger `Task#failExternally` in step 1, and only receives the cancel rpc from JM in step 3. And I guess at this time later than the source period, the canceler thread could be created succesfully after some GCs, then these tasks could be canceled as reported to JM side.

I think the key problem is under OOM case some behaviors are not within expectations, so it might bring problems. Maybe we should handle OOM error in extreme way like making TM exit to solve the potential issue.

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月21日(星期五) 16:34
To:zhijiang <[hidden email]>; Joshua Fan <[hidden email]>
Cc:user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

The logs are attached to the initial mail.

Echoing my thoughts from earlier: from the logs it looks as if the TM never even submits the terminal state RPC calls for several tasks to the JM.

On 21/06/2019 10:30, zhijiang wrote:
Hi Joshua,

If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really in CANCELED state on TM side, but in CANCELING state on JM side, then it might indicates the terminal state RPC was not received by JM. I am not sure whether the OOM would cause this issue happen resulting in unexpected behavior.

In addition, you mentioned these tasks are still active after OOM and was called to cancel, so I am not sure what is the specific periods for your attached TM stack. I think it might provide help if you could provide corresponding TM log and JM log. 
From TM log it is easy to check the task final state. 

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan [hidden email]
Send Time:2019年6月20日(星期四) 11:55
To:zhijiang [hidden email]
Cc:user [hidden email]; Till Rohrmann [hidden email]; Chesnay Schepler [hidden email]
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

zhijiang

I did not capture the job ui, the topology is in FAILING state, but the persistentbolt subtasks as can be seen in the picture attached in first mail was all canceled, and the parsebolt subtasks as described before had one subtask FAILED, other subtasks CANCELED, but the source subtasks had one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.

The subtask status described above is in jm view, but in tm view, all of the source subtask was in FAILED, do not know why jm was not notify about this.

As all of the failed status was triggered by a oom by the subtask can not create native thread when checkpointing, I also dumped the stack of the jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) are still active after it throwed a oom and was called to cancel . I attached the jstack file in this email.

Yours sincerely
Joshua

On Wed, Jun 19, 2019 at 4:40 PM zhijiang <[hidden email]> wrote:
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:
> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua




Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Zhijiang(wangzhijiang999)
Thanks for opening this ticket and I would watch it.

Flink does not handle OOM issue specially. I remembered we ever discussed the similar issue before but forgot the conclusion then or have other concerns for it.
I am not sure whether it is worth to fix atm, maybe Till or Chesnay could give a final decision.

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan <[hidden email]>
Send Time:2019年6月25日(星期二) 11:10
To:zhijiang <[hidden email]>
Cc:Chesnay Schepler <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

Hi Zhijiang

Thank you for your analysis. I agree with it. The solution may be to let tm exit like you mentioned when any type of oom occurs, because the flink has no control on a tm when a oom occurs.


Don't know it is worth to fix.

Thank you all.

Yours sincerely
Joshua

On Fri, Jun 21, 2019 at 5:32 PM zhijiang <[hidden email]> wrote:
Thanks for the reminding @Chesnay Schepler .

I just looked throught the related logs. Actually all the five "Source: ServiceLog" tasks are not in terminal state on JM view, the relevant processes are as follows:

1. The checkpoint in task causes OOM issue which would call `Task#failExternally` as a result, we could see the log "Attempting to fail task externally" in tm.
2. The source task would transform state from RUNNING to FAILED and then starts a canceler thread for canceling task, we could see log "Triggering cancellation of task" in tm.
3. When JM starts to cancel the source tasks, the rpc call `Task#cancelExecution` would find the task was already in FAILED state as above step 2, we could see log "Attempting to cancel task" in tm.

At last all the five source tasks are not in terminal states from jm log, I guess the step 2 might not create canceler thread successfully, because the root failover was caused by OOM during creating native thread in step1, so it might exist possibilities that createing canceler thread is not successful as well in OOM case which is unstable. If so, the source task would not been interrupted at all, then it would not report to JM as well, but the state is already changed to FAILED before. 

For the other vertex tasks, it does not trigger `Task#failExternally` in step 1, and only receives the cancel rpc from JM in step 3. And I guess at this time later than the source period, the canceler thread could be created succesfully after some GCs, then these tasks could be canceled as reported to JM side.

I think the key problem is under OOM case some behaviors are not within expectations, so it might bring problems. Maybe we should handle OOM error in extreme way like making TM exit to solve the potential issue.

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月21日(星期五) 16:34
To:zhijiang <[hidden email]>; Joshua Fan <[hidden email]>
Cc:user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

The logs are attached to the initial mail.

Echoing my thoughts from earlier: from the logs it looks as if the TM never even submits the terminal state RPC calls for several tasks to the JM.

On 21/06/2019 10:30, zhijiang wrote:
Hi Joshua,

If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really in CANCELED state on TM side, but in CANCELING state on JM side, then it might indicates the terminal state RPC was not received by JM. I am not sure whether the OOM would cause this issue happen resulting in unexpected behavior.

In addition, you mentioned these tasks are still active after OOM and was called to cancel, so I am not sure what is the specific periods for your attached TM stack. I think it might provide help if you could provide corresponding TM log and JM log. 
From TM log it is easy to check the task final state. 

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan [hidden email]
Send Time:2019年6月20日(星期四) 11:55
To:zhijiang [hidden email]
Cc:user [hidden email]; Till Rohrmann [hidden email]; Chesnay Schepler [hidden email]
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

zhijiang

I did not capture the job ui, the topology is in FAILING state, but the persistentbolt subtasks as can be seen in the picture attached in first mail was all canceled, and the parsebolt subtasks as described before had one subtask FAILED, other subtasks CANCELED, but the source subtasks had one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.

The subtask status described above is in jm view, but in tm view, all of the source subtask was in FAILED, do not know why jm was not notify about this.

As all of the failed status was triggered by a oom by the subtask can not create native thread when checkpointing, I also dumped the stack of the jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) are still active after it throwed a oom and was called to cancel . I attached the jstack file in this email.

Yours sincerely
Joshua

On Wed, Jun 19, 2019 at 4:40 PM zhijiang <[hidden email]> wrote:
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:
> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua





Reply | Threaded
Open this post in threaded view
|

Re: Maybe a flink bug. Job keeps in FAILING state

Till Rohrmann
Thanks for reporting this problem Joshua. I think this is actually a problem we should fix. The cause seems to be that we swallow the OOM exception when calling `Task#failExternally`. Probably we don't set the right uncaught exception handler in the thread which executes the checkpoint. Let's continue our discussion on the JIRA issue.

Thanks Zhijiang for analysing the problem.

Cheers,
Till

On Tue, Jun 25, 2019 at 5:21 AM zhijiang <[hidden email]> wrote:
Thanks for opening this ticket and I would watch it.

Flink does not handle OOM issue specially. I remembered we ever discussed the similar issue before but forgot the conclusion then or have other concerns for it.
I am not sure whether it is worth to fix atm, maybe Till or Chesnay could give a final decision.

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan <[hidden email]>
Send Time:2019年6月25日(星期二) 11:10
To:zhijiang <[hidden email]>
Cc:Chesnay Schepler <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

Hi Zhijiang

Thank you for your analysis. I agree with it. The solution may be to let tm exit like you mentioned when any type of oom occurs, because the flink has no control on a tm when a oom occurs.


Don't know it is worth to fix.

Thank you all.

Yours sincerely
Joshua

On Fri, Jun 21, 2019 at 5:32 PM zhijiang <[hidden email]> wrote:
Thanks for the reminding @Chesnay Schepler .

I just looked throught the related logs. Actually all the five "Source: ServiceLog" tasks are not in terminal state on JM view, the relevant processes are as follows:

1. The checkpoint in task causes OOM issue which would call `Task#failExternally` as a result, we could see the log "Attempting to fail task externally" in tm.
2. The source task would transform state from RUNNING to FAILED and then starts a canceler thread for canceling task, we could see log "Triggering cancellation of task" in tm.
3. When JM starts to cancel the source tasks, the rpc call `Task#cancelExecution` would find the task was already in FAILED state as above step 2, we could see log "Attempting to cancel task" in tm.

At last all the five source tasks are not in terminal states from jm log, I guess the step 2 might not create canceler thread successfully, because the root failover was caused by OOM during creating native thread in step1, so it might exist possibilities that createing canceler thread is not successful as well in OOM case which is unstable. If so, the source task would not been interrupted at all, then it would not report to JM as well, but the state is already changed to FAILED before. 

For the other vertex tasks, it does not trigger `Task#failExternally` in step 1, and only receives the cancel rpc from JM in step 3. And I guess at this time later than the source period, the canceler thread could be created succesfully after some GCs, then these tasks could be canceled as reported to JM side.

I think the key problem is under OOM case some behaviors are not within expectations, so it might bring problems. Maybe we should handle OOM error in extreme way like making TM exit to solve the potential issue.

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月21日(星期五) 16:34
To:zhijiang <[hidden email]>; Joshua Fan <[hidden email]>
Cc:user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

The logs are attached to the initial mail.

Echoing my thoughts from earlier: from the logs it looks as if the TM never even submits the terminal state RPC calls for several tasks to the JM.

On 21/06/2019 10:30, zhijiang wrote:
Hi Joshua,

If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really in CANCELED state on TM side, but in CANCELING state on JM side, then it might indicates the terminal state RPC was not received by JM. I am not sure whether the OOM would cause this issue happen resulting in unexpected behavior.

In addition, you mentioned these tasks are still active after OOM and was called to cancel, so I am not sure what is the specific periods for your attached TM stack. I think it might provide help if you could provide corresponding TM log and JM log. 
From TM log it is easy to check the task final state. 

Best,
Zhijiang
------------------------------------------------------------------
From:Joshua Fan [hidden email]
Send Time:2019年6月20日(星期四) 11:55
To:zhijiang [hidden email]
Cc:user [hidden email]; Till Rohrmann [hidden email]; Chesnay Schepler [hidden email]
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

zhijiang

I did not capture the job ui, the topology is in FAILING state, but the persistentbolt subtasks as can be seen in the picture attached in first mail was all canceled, and the parsebolt subtasks as described before had one subtask FAILED, other subtasks CANCELED, but the source subtasks had one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.

The subtask status described above is in jm view, but in tm view, all of the source subtask was in FAILED, do not know why jm was not notify about this.

As all of the failed status was triggered by a oom by the subtask can not create native thread when checkpointing, I also dumped the stack of the jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) are still active after it throwed a oom and was called to cancel . I attached the jstack file in this email.

Yours sincerely
Joshua

On Wed, Jun 19, 2019 at 4:40 PM zhijiang <[hidden email]> wrote:
As long as one task is in canceling state, then the job status might be still in canceling state.

@Joshua Do you confirm all of the tasks in topology were already in terminal state such as failed or canceled?

Best,
Zhijiang
------------------------------------------------------------------
From:Chesnay Schepler <[hidden email]>
Send Time:2019年6月19日(星期三) 16:32
To:Joshua Fan <[hidden email]>; user <[hidden email]>; Till Rohrmann <[hidden email]>
Subject:Re: Maybe a flink bug. Job keeps in FAILING state

@Till have you see something like this before? Despite all source tasks 
reaching a terminal state on a TM (FAILED) it does not send updates to 
the JM for all of them, but only a single one.

On 18/06/2019 12:14, Joshua Fan wrote:
> Hi All,
> There is a topology of 3 operator, such as, source, parser, and 
> persist. Occasionally, 5 subtasks of the source encounters exception 
> and turns to failed, at the same time, one subtask of the parser runs 
> into exception and turns to failed too. The jobmaster gets a message 
> of the parser's failed. The jobmaster then try to cancel all the 
> subtask, most of the subtasks of the three operator turns to canceled 
> except the 5 subtasks of the source, because the state of the 5 ones 
> is already FAILED before jobmaster try to cancel it. Then the 
> jobmaster can not reach a final state but keeps in  Failing state 
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm 
> with 10 slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
>
> Yours
> Joshua