TM occasionally hang in deploying state in Flink 1.5

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

TM occasionally hang in deploying state in Flink 1.5

qi luo
Hi all,

We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 

On checking TM log, it shows that it stuck in downloading jars in BlobClient:

————
...
INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000).
INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) switched from CREATED to DEPLOYING.
INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating FileSystem stream leak safety net for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING].
INFO  org.apache.flink.runtime.blob.BlobClient                          - Downloading 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 from some-host-ip-port

no more logs...
————

It seems that the TM is calling BlobClient to download jars from JM/BlobServer. Under hood it’s calling Socket.connect() and then Socket.read() to retrieve results. 

Should we add timeout in socket operations in BlobClient to resolve this issue?

Thanks,
Qi
Reply | Threaded
Open this post in threaded view
|

Re: TM occasionally hang in deploying state in Flink 1.5

qi luo
Hello,

This issue occurred again and we dumped the TM thread. It indeed hung on socket read to download jar from Blob server:

"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable [0x00007fb97cfbf000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
        at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
        at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)
        at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
        at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
        at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
        - locked <0x000000078ab60ba8> (a java.lang.Object)
        at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
        at java.lang.Thread.run(Thread.java:748)

I checked the latest master code. There’s still no socket timeout in Blob client. Should I create an issue to add this timeout?

Regards,
Qi 

On Apr 19, 2019, at 7:49 PM, qi luo <[hidden email]> wrote:

Hi all,

We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 

On checking TM log, it shows that it stuck in downloading jars in BlobClient:

————
...
INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000).
INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) switched from CREATED to DEPLOYING.
INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating FileSystem stream leak safety net for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING].
INFO  org.apache.flink.runtime.blob.BlobClient                          - Downloading 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 from some-host-ip-port

no more logs...
————

It seems that the TM is calling BlobClient to download jars from JM/BlobServer. Under hood it’s calling Socket.connect() and then Socket.read() to retrieve results. 

Should we add timeout in socket operations in BlobClient to resolve this issue?

Thanks,
Qi

Reply | Threaded
Open this post in threaded view
|

Re: TM occasionally hang in deploying state in Flink 1.5

Dawid Wysakowicz-2

Hi,

Feel free to open a JIRA for this issue. By the way have you investigated what is the root cause for it hanging?

Best,

Dawid

On 25/04/2019 08:55, qi luo wrote:
Hello,

This issue occurred again and we dumped the TM thread. It indeed hung on socket read to download jar from Blob server:

"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable [0x00007fb97cfbf000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
        at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
        at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)
        at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
        at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
        at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
        - locked <0x000000078ab60ba8> (a java.lang.Object)
        at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
        at java.lang.Thread.run(Thread.java:748)

I checked the latest master code. There’s still no socket timeout in Blob client. Should I create an issue to add this timeout?

Regards,
Qi 

On Apr 19, 2019, at 7:49 PM, qi luo <[hidden email]> wrote:

Hi all,

We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 

On checking TM log, it shows that it stuck in downloading jars in BlobClient:

————
...
INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000).
INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) switched from CREATED to DEPLOYING.
INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating FileSystem stream leak safety net for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING].
INFO  org.apache.flink.runtime.blob.BlobClient                          - Downloading 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 from some-host-ip-port

no more logs...
————

It seems that the TM is calling BlobClient to download jars from JM/BlobServer. Under hood it’s calling Socket.connect() and then Socket.read() to retrieve results. 

Should we add timeout in socket operations in BlobClient to resolve this issue?

Thanks,
Qi


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TM occasionally hang in deploying state in Flink 1.5

qi luo
Thanks Dawid, I’ve created an issue for this https://jira.apache.org/jira/browse/FLINK-12426. Though we’re using 1.5 but this may affect later versions.

I’m still investigating the root case but no result yet. This happens occasionally and isn't easy to reproduce.

On Apr 25, 2019, at 6:40 PM, Dawid Wysakowicz <[hidden email]> wrote:

Hi,

Feel free to open a JIRA for this issue. By the way have you investigated what is the root cause for it hanging?

Best,

Dawid

On 25/04/2019 08:55, qi luo wrote:
Hello,

This issue occurred again and we dumped the TM thread. It indeed hung on socket read to download jar from Blob server:

"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable [0x00007fb97cfbf000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
        at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
        at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)
        at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
        at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
        at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
        - locked <0x000000078ab60ba8> (a java.lang.Object)
        at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
        at java.lang.Thread.run(Thread.java:748)

I checked the latest master code. There’s still no socket timeout in Blob client. Should I create an issue to add this timeout?

Regards,
Qi 

On Apr 19, 2019, at 7:49 PM, qi luo <[hidden email]> wrote:

Hi all,

We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 

On checking TM log, it shows that it stuck in downloading jars in BlobClient:

————
...
INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000).
INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) switched from CREATED to DEPLOYING.
INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating FileSystem stream leak safety net for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING].
INFO  org.apache.flink.runtime.blob.BlobClient                          - Downloading 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 from some-host-ip-port

no more logs...
————

It seems that the TM is calling BlobClient to download jars from JM/BlobServer. Under hood it’s calling Socket.connect() and then Socket.read() to retrieve results. 

Should we add timeout in socket operations in BlobClient to resolve this issue?

Thanks,
Qi