How to debug a job stuck in a deployment/run loop?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to debug a job stuck in a deployment/run loop?

Jason Kania
I am attempting to migrate from 1.7.1 to 1.9.1 and I have hit a problem where previously working jobs can no longer launch after being submitted. In the UI, the submitted jobs show up as deploying for a period, then go into a run state before returning to the deploy state and this repeats regularly with the job bouncing between states. No exceptions or errors are visible in the logs. There is no data coming in for the job to process and the kafka queues are empty.

If I look at the thread activity of the task manager running the job in top, I see that the busiest threads are flink-akka threads, sometimes jumping to very high CPU numbers. That is all I have for info.

Any suggestions on how to debug this? I can set break points and connect if that helps, just not sure at this point where to start.

Thanks,

Jason
Reply | Threaded
Open this post in threaded view
|

Re: How to debug a job stuck in a deployment/run loop?

Arvid Heise-3
Hi Jason,

could you describe your topology? Are you writing to Kafka? Are you using exactly once? Are you seeing any warning?
If so, one thing that immediately comes to my mind is transaction.max.timeout.ms. If the value in flink (by default 1h) is higher than what the Kafka brokers support, it may run into indefinite restart loops in rare cases.

"Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode."

Best,

Arvid

On Fri, Jan 24, 2020 at 2:47 AM Jason Kania <[hidden email]> wrote:
I am attempting to migrate from 1.7.1 to 1.9.1 and I have hit a problem where previously working jobs can no longer launch after being submitted. In the UI, the submitted jobs show up as deploying for a period, then go into a run state before returning to the deploy state and this repeats regularly with the job bouncing between states. No exceptions or errors are visible in the logs. There is no data coming in for the job to process and the kafka queues are empty.

If I look at the thread activity of the task manager running the job in top, I see that the busiest threads are flink-akka threads, sometimes jumping to very high CPU numbers. That is all I have for info.

Any suggestions on how to debug this? I can set break points and connect if that helps, just not sure at this point where to start.

Thanks,

Jason
Reply | Threaded
Open this post in threaded view
|

Re: How to debug a job stuck in a deployment/run loop?

Till Rohrmann
Hi Jason,

getting access to the log files would help most to figure out what's going wrong.

Cheers,
Till

On Tue, Jan 28, 2020 at 9:08 AM Arvid Heise <[hidden email]> wrote:
Hi Jason,

could you describe your topology? Are you writing to Kafka? Are you using exactly once? Are you seeing any warning?
If so, one thing that immediately comes to my mind is transaction.max.timeout.ms. If the value in flink (by default 1h) is higher than what the Kafka brokers support, it may run into indefinite restart loops in rare cases.

"Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode."

Best,

Arvid

On Fri, Jan 24, 2020 at 2:47 AM Jason Kania <[hidden email]> wrote:
I am attempting to migrate from 1.7.1 to 1.9.1 and I have hit a problem where previously working jobs can no longer launch after being submitted. In the UI, the submitted jobs show up as deploying for a period, then go into a run state before returning to the deploy state and this repeats regularly with the job bouncing between states. No exceptions or errors are visible in the logs. There is no data coming in for the job to process and the kafka queues are empty.

If I look at the thread activity of the task manager running the job in top, I see that the busiest threads are flink-akka threads, sometimes jumping to very high CPU numbers. That is all I have for info.

Any suggestions on how to debug this? I can set break points and connect if that helps, just not sure at this point where to start.

Thanks,

Jason