Flink Job sometimes does not stop as expected on cancelling.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Job sometimes does not stop as expected on cancelling.

Mahantesh Patil
Hi Team,

We have Flink jobs running in cluster mode. When I cancel the job and check for status it still shows as running.

Below is logs generated.  I do not see any useful information. Could you guys point me in a right direction to debug this , our flink jobs do heavy processing , my understanding is on cancelling flink waits till operator completes processing to store save point. So is it happening because one of the operator is hung waiting for resource and is there a way to configure timeout.


Appreciate your help.



10 Jan 2019 20:36:51,490 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: control -> control-filter (1/1) (c6f50328bb09ce1ad389e5ea614fc8a9).
10 Jan 2019 20:36:51,490 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab).
10 Jan 2019 20:36:51,490 INFO org.apache.flink.runtime.taskmanager.Task - interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab).
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1).
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1).
10 Jan 2019 20:36:51,492 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed).
10 Jan 2019 20:36:51,492 INFO org.apache.flink.runtime.taskmanager.Task - interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,628 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed).
10 Jan 2019 20:36:51,680 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279).
10 Jan 2019 20:36:51,680 INFO org.apache.flink.runtime.taskmanager.Task - agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,680 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279).
10 Jan 2019 20:36:51,811 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Could not shut down timer service
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
        at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
        at java.lang.Thread.run(Thread.java:748)
10 Jan 2019 20:36:51,811 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Could not shut down timer service
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
        at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
        at java.lang.Thread.run(Thread.java:748)
10 Jan 2019 20:36:51,813 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e).
10 Jan 2019 20:37:07,156 INFO org.apache.flink.runtime.taskmanager.Task - subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e) switched from RUNNING to CANCELING.
10 Jan 2019 20:37:07,157 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e).
10 Jan 2019 20:37:07,164 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e).
10 Jan 2019 20:37:07,164 INFO org.apache.flink.runtime.taskmanager.Task - filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e) switched from RUNNING to CANCELING.
10 Jan 2019 20:37:07,164 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task interaction-details (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task current-longests (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task interactions-count (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task agents-working-on-interactions (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task subscription-management (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task filter-business-metrics -> Sink: data_feed (1/1) is already in state CANCELING





Regards,
Mahantesh Patil.
Reply | Threaded
Open this post in threaded view
|

Re: Flink Job sometimes does not stop as expected on cancelling.

Mahantesh Patil
Hi Team,

Following up to check if any of you have faced this issue.  If so how to resolve this.

Regards,
Mahantesh Patil.

On Mon, Feb 11, 2019 at 3:50 PM Mahantesh Patil <[hidden email]> wrote:
Hi Team,

We have Flink jobs running in cluster mode. When I cancel the job and check for status it still shows as running.

Below is logs generated.  I do not see any useful information. Could you guys point me in a right direction to debug this , our flink jobs do heavy processing , my understanding is on cancelling flink waits till operator completes processing to store save point. So is it happening because one of the operator is hung waiting for resource and is there a way to configure timeout.


Appreciate your help.



10 Jan 2019 20:36:51,490 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: control -> control-filter (1/1) (c6f50328bb09ce1ad389e5ea614fc8a9).
10 Jan 2019 20:36:51,490 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab).
10 Jan 2019 20:36:51,490 INFO org.apache.flink.runtime.taskmanager.Task - interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab).
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1).
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,491 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1).
10 Jan 2019 20:36:51,492 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed).
10 Jan 2019 20:36:51,492 INFO org.apache.flink.runtime.taskmanager.Task - interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,628 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed).
10 Jan 2019 20:36:51,680 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279).
10 Jan 2019 20:36:51,680 INFO org.apache.flink.runtime.taskmanager.Task - agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279) switched from RUNNING to CANCELING.
10 Jan 2019 20:36:51,680 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279).
10 Jan 2019 20:36:51,811 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Could not shut down timer service
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
        at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
        at java.lang.Thread.run(Thread.java:748)
10 Jan 2019 20:36:51,811 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Could not shut down timer service
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
        at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
        at java.lang.Thread.run(Thread.java:748)
10 Jan 2019 20:36:51,813 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e).
10 Jan 2019 20:37:07,156 INFO org.apache.flink.runtime.taskmanager.Task - subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e) switched from RUNNING to CANCELING.
10 Jan 2019 20:37:07,157 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e).
10 Jan 2019 20:37:07,164 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e).
10 Jan 2019 20:37:07,164 INFO org.apache.flink.runtime.taskmanager.Task - filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e) switched from RUNNING to CANCELING.
10 Jan 2019 20:37:07,164 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interaction-details (1/1) (dc946a863e34f66a95df3a8e8d540cab).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task interaction-details (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task current-longests (1/1) (2f1cee63ab479a8e9e492c990311ccc1).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task current-longests (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task interactions-count (1/1) (aff734ce398978cdc3ac40a1e530d0ed).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task interactions-count (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task agents-working-on-interactions (1/1) (15aa5e81a66a46127ab85719b9df7279).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task agents-working-on-interactions (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task subscription-management (1/1) (f5e529744e645e6acf00d2c5aab2006e).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task subscription-management (1/1) is already in state CANCELING
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task filter-business-metrics -> Sink: data_feed (1/1) (6c229624ea4735215fe98f5458c58c8e).
10 Jan 2019 20:37:07,182 INFO org.apache.flink.runtime.taskmanager.Task - Task filter-business-metrics -> Sink: data_feed (1/1) is already in state CANCELING





Regards,
Mahantesh Patil.