Flink 1.11.1 - job manager exists with exit code 0

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.11.1 - job manager exists with exit code 0

Alexey Trenikhun
Hello,

I've Flink 1.11.1 session cluster running via docker compose, I upload job jar, when I submit job jobmanager exits without any errors in log:

...
{"@timestamp":"2020-07-25T04:32:54.007Z","@version":"1","message":"Starting execution of job katana-fsp (64ff3943fdc5024c5beef1612518c627) under job master id 00000000000000000000000000000000.","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.011Z","@version":"1","message":"Stopped BLOB server at 0.0.0.0:6124","logger_name":"org.apache.flink.runtime.blob.BlobServer","thread_name":"BlobServer shutdown hook","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.015Z","@version":"1","message":"Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.016Z","@version":"1","message":"Job katana-fsp (64ff3943fdc5024c5beef1612518c627) switched from state CREATED to RUNNING.","logger_name":"org.apache.flink.runtime.executiongraph.ExecutionGraph","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}

Any ideas how to diagnose it? 

Thanks,
Alexey
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11.1 - job manager exists with exit code 0

rmetzger0
Hey Alexey,

What is the exit code of the JobManager? Can you check if it has been killed by the OOM killer?
You could also try to run the job with DEBUG log level, it might give us an additional indication why the JVM dies.
What kind of job are you submitting? Is it complicated?

On Sat, Jul 25, 2020 at 6:43 AM Alexey Trenikhun <[hidden email]> wrote:
Hello,

I've Flink 1.11.1 session cluster running via docker compose, I upload job jar, when I submit job jobmanager exits without any errors in log:

...
{"@timestamp":"2020-07-25T04:32:54.007Z","@version":"1","message":"Starting execution of job katana-fsp (64ff3943fdc5024c5beef1612518c627) under job master id 00000000000000000000000000000000.","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.011Z","@version":"1","message":"Stopped BLOB server at 0.0.0.0:6124","logger_name":"org.apache.flink.runtime.blob.BlobServer","thread_name":"BlobServer shutdown hook","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.015Z","@version":"1","message":"Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.016Z","@version":"1","message":"Job katana-fsp (64ff3943fdc5024c5beef1612518c627) switched from state CREATED to RUNNING.","logger_name":"org.apache.flink.runtime.executiongraph.ExecutionGraph","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}

Any ideas how to diagnose it? 

Thanks,
Alexey
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11.1 - job manager exists with exit code 0

rmetzger0
Ah yeah, after sending the email, I saw that the exit code is in the subject line :)

Can you post the entire log? What I find confusing is this log statement: "Stopped BLOB server at 0.0.0.0:6124". The BLOB server is usually only stopped during shutdown. For some reason, the JobManager is in the process of shutting down.

On Wed, Jul 29, 2020 at 7:38 AM Robert Metzger <[hidden email]> wrote:
Hey Alexey,

What is the exit code of the JobManager? Can you check if it has been killed by the OOM killer?
You could also try to run the job with DEBUG log level, it might give us an additional indication why the JVM dies.
What kind of job are you submitting? Is it complicated?

On Sat, Jul 25, 2020 at 6:43 AM Alexey Trenikhun <[hidden email]> wrote:
Hello,

I've Flink 1.11.1 session cluster running via docker compose, I upload job jar, when I submit job jobmanager exits without any errors in log:

...
{"@timestamp":"2020-07-25T04:32:54.007Z","@version":"1","message":"Starting execution of job katana-fsp (64ff3943fdc5024c5beef1612518c627) under job master id 00000000000000000000000000000000.","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.011Z","@version":"1","message":"Stopped BLOB server at 0.0.0.0:6124","logger_name":"org.apache.flink.runtime.blob.BlobServer","thread_name":"BlobServer shutdown hook","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.015Z","@version":"1","message":"Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.016Z","@version":"1","message":"Job katana-fsp (64ff3943fdc5024c5beef1612518c627) switched from state CREATED to RUNNING.","logger_name":"org.apache.flink.runtime.executiongraph.ExecutionGraph","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}

Any ideas how to diagnose it? 

Thanks,
Alexey
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11.1 - job manager exists with exit code 0

Alexey Trenikhun
In reply to this post by rmetzger0
Hi Robert,
I found the cause, it was due to bug in job itself - code after streamEnv.execute(...) called System.exit(0), it was un-noticeable before 1.11, but with 1.11, I guess in Application Mode, main is called from job manager directly, and System.exit(0) just exits whole JVM.

Thank you and sorry for unnecessary noise
Alexey


From: Robert Metzger <[hidden email]>
Sent: Tuesday, July 28, 2020 10:38:42 PM
To: Alexey Trenikhun <[hidden email]>
Cc: Flink User Mail List <[hidden email]>
Subject: Re: Flink 1.11.1 - job manager exists with exit code 0
 
Hey Alexey,

What is the exit code of the JobManager? Can you check if it has been killed by the OOM killer?
You could also try to run the job with DEBUG log level, it might give us an additional indication why the JVM dies.
What kind of job are you submitting? Is it complicated?

On Sat, Jul 25, 2020 at 6:43 AM Alexey Trenikhun <[hidden email]> wrote:
Hello,

I've Flink 1.11.1 session cluster running via docker compose, I upload job jar, when I submit job jobmanager exits without any errors in log:

...
{"@timestamp":"2020-07-25T04:32:54.007Z","@version":"1","message":"Starting execution of job katana-fsp (64ff3943fdc5024c5beef1612518c627) under job master id 00000000000000000000000000000000.","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.011Z","@version":"1","message":"Stopped BLOB server at 0.0.0.0:6124","logger_name":"org.apache.flink.runtime.blob.BlobServer","thread_name":"BlobServer shutdown hook","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.015Z","@version":"1","message":"Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.016Z","@version":"1","message":"Job katana-fsp (64ff3943fdc5024c5beef1612518c627) switched from state CREATED to RUNNING.","logger_name":"org.apache.flink.runtime.executiongraph.ExecutionGraph","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}

Any ideas how to diagnose it? 

Thanks,
Alexey
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11.1 - job manager exists with exit code 0

rmetzger0
Thanks for reporting back. Glad you found the issue. This reminds me of a ticket about this topic some time ago :) https://issues.apache.org/jira/browse/FLINK-15156

On Wed, Jul 29, 2020 at 7:51 AM Alexey Trenikhun <[hidden email]> wrote:
Hi Robert,
I found the cause, it was due to bug in job itself - code after streamEnv.execute(...) called System.exit(0), it was un-noticeable before 1.11, but with 1.11, I guess in Application Mode, main is called from job manager directly, and System.exit(0) just exits whole JVM.

Thank you and sorry for unnecessary noise
Alexey


From: Robert Metzger <[hidden email]>
Sent: Tuesday, July 28, 2020 10:38:42 PM
To: Alexey Trenikhun <[hidden email]>
Cc: Flink User Mail List <[hidden email]>
Subject: Re: Flink 1.11.1 - job manager exists with exit code 0
 
Hey Alexey,

What is the exit code of the JobManager? Can you check if it has been killed by the OOM killer?
You could also try to run the job with DEBUG log level, it might give us an additional indication why the JVM dies.
What kind of job are you submitting? Is it complicated?

On Sat, Jul 25, 2020 at 6:43 AM Alexey Trenikhun <[hidden email]> wrote:
Hello,

I've Flink 1.11.1 session cluster running via docker compose, I upload job jar, when I submit job jobmanager exits without any errors in log:

...
{"@timestamp":"2020-07-25T04:32:54.007Z","@version":"1","message":"Starting execution of job katana-fsp (64ff3943fdc5024c5beef1612518c627) under job master id 00000000000000000000000000000000.","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.011Z","@version":"1","message":"Stopped BLOB server at 0.0.0.0:6124","logger_name":"org.apache.flink.runtime.blob.BlobServer","thread_name":"BlobServer shutdown hook","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.015Z","@version":"1","message":"Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}
{"@timestamp":"2020-07-25T04:32:54.016Z","@version":"1","message":"Job katana-fsp (64ff3943fdc5024c5beef1612518c627) switched from state CREATED to RUNNING.","logger_name":"org.apache.flink.runtime.executiongraph.ExecutionGraph","thread_name":"flink-akka.actor.default-dispatcher-18","level":"INFO","level_value":20000}

Any ideas how to diagnose it? 

Thanks,
Alexey