JobManager refusing connections when running many jobs in parallel?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

JobManager refusing connections when running many jobs in parallel?

Hailu, Andreas

Hi team,

 

We’ve observed that when we submit a decent number of jobs in parallel from a single Job Master, we encounter job failures due with Connection Refused exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s seemingly transient, however, as upon several retries the job succeeds. The surface level error varies, but digging deeper in stack traces it looks to stem from the Job Manager no longer accepting connections.

 

I’ve included a couple of examples below from failed jobs’ driver logs, with different errors stemming from a connection refused error:

 

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 30 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-563.dc.gs.com: Using job manager web tracking url <a href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

... 1 more

Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)

... 6 more

Caused by: java.net.ConnectException: Connection refused

 

Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 60 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-484.dc.gs.com: Using job manager web tracking url <a href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface  (http://d43723-484.dc.gs.com:36757) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.

at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382)

at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)

at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

... 3 more

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 4 more

... (this pattern repeats for number of unique JobIDs)

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

...

26 05:46:39,734 [CASHFLOW-18394] WARN  FlinkClusterStateMonitor - Error while attempting to fetch job details for job 4d20537a676df2855e29b31b1de1ead5

com.gs.ep.data.lake.refinerlib.restful.RestfulException: failed connecting to http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 after 1 time(s)

Caused by: java.net.ConnectException: Connection refused

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at java.net.Socket.connect(Socket.java:538)

at sun.net.NetworkClient.doConnect(NetworkClient.java:180)

 

These connection refusal exceptions and their transient nature makes me think that it might be a network-related issue. It’s not uncommon for us to need to run 100+ jobs in parallel. How can we investigate what’s causing the Job Manager to periodically refuse connections? I can see a Netty package in the first example’s stack trace – is there any option we can tune?

 

____________

 

Andreas Hailu

Data Lake Engineering | Goldman Sachs & Co.

 




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

Re: JobManager refusing connections when running many jobs in parallel?

rmetzger0
Hi Andreas,

Thanks for reaching out .. this should not happen ...
Maybe your operating system has configured low limits for the number of concurrent connections / sockets. Maybe this thread is helpful: https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections (there might better SO threads, I didn't put much effort into searching :) )

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <[hidden email]> wrote:

Hi team,

 

We’ve observed that when we submit a decent number of jobs in parallel from a single Job Master, we encounter job failures due with Connection Refused exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s seemingly transient, however, as upon several retries the job succeeds. The surface level error varies, but digging deeper in stack traces it looks to stem from the Job Manager no longer accepting connections.

 

I’ve included a couple of examples below from failed jobs’ driver logs, with different errors stemming from a connection refused error:

 

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 30 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-563.dc.gs.com: Using job manager web tracking url <a href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

... 1 more

Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)

... 6 more

Caused by: java.net.ConnectException: Connection refused

 

Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 60 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-484.dc.gs.com: Using job manager web tracking url <a href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface  (http://d43723-484.dc.gs.com:36757) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.

at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382)

at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)

at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

... 3 more

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 4 more

... (this pattern repeats for number of unique JobIDs)

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

...

26 05:46:39,734 [CASHFLOW-18394] WARN  FlinkClusterStateMonitor - Error while attempting to fetch job details for job 4d20537a676df2855e29b31b1de1ead5

com.gs.ep.data.lake.refinerlib.restful.RestfulException: failed connecting to http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 after 1 time(s)

Caused by: java.net.ConnectException: Connection refused

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at java.net.Socket.connect(Socket.java:538)

at sun.net.NetworkClient.doConnect(NetworkClient.java:180)

 

These connection refusal exceptions and their transient nature makes me think that it might be a network-related issue. It’s not uncommon for us to need to run 100+ jobs in parallel. How can we investigate what’s causing the Job Manager to periodically refuse connections? I can see a Netty package in the first example’s stack trace – is there any option we can tune?

 

____________

 

Andreas Hailu

Data Lake Engineering | Goldman Sachs & Co.

 




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

Re: JobManager refusing connections when running many jobs in parallel?

Hailu, Andreas
Thanks for pointing this out. We had a look - the nodes in our cluster have a cap of 65K open files and we aren’t breaching 50% per metrics, so I don’t believe this is the problem.

The connection refused error makes us think it’s some process using a thread pool for the JobManager hitting capacity on a port somewhere. This sound correct? Is there a config for us to increase the pool size?

From: Robert Metzger <[hidden email]>
Sent: Wednesday, July 29, 2020 1:52:53 AM
To: Hailu, Andreas [Engineering]
Cc: [hidden email]; Shah, Siddharth [Engineering]
Subject: Re: JobManager refusing connections when running many jobs in parallel?
 
Hi Andreas,

Thanks for reaching out .. this should not happen ...
Maybe your operating system has configured low limits for the number of concurrent connections / sockets. Maybe this thread is helpful: https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections (there might better SO threads, I didn't put much effort into searching :) )

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <[hidden email]> wrote:

Hi team,

 

We’ve observed that when we submit a decent number of jobs in parallel from a single Job Master, we encounter job failures due with Connection Refused exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s seemingly transient, however, as upon several retries the job succeeds. The surface level error varies, but digging deeper in stack traces it looks to stem from the Job Manager no longer accepting connections.

 

I’ve included a couple of examples below from failed jobs’ driver logs, with different errors stemming from a connection refused error:

 

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 30 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-563.dc.gs.com: Using job manager web tracking url <a href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

... 1 more

Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)

... 6 more

Caused by: java.net.ConnectException: Connection refused

 

Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 60 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-484.dc.gs.com: Using job manager web tracking url <a href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface  (http://d43723-484.dc.gs.com:36757) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.

at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382)

at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)

at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

... 3 more

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 4 more

... (this pattern repeats for number of unique JobIDs)

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

...

26 05:46:39,734 [CASHFLOW-18394] WARN  FlinkClusterStateMonitor - Error while attempting to fetch job details for job 4d20537a676df2855e29b31b1de1ead5

com.gs.ep.data.lake.refinerlib.restful.RestfulException: failed connecting to http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 after 1 time(s)

Caused by: java.net.ConnectException: Connection refused

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at java.net.Socket.connect(Socket.java:538)

at sun.net.NetworkClient.doConnect(NetworkClient.java:180)

 

These connection refusal exceptions and their transient nature makes me think that it might be a network-related issue. It’s not uncommon for us to need to run 100+ jobs in parallel. How can we investigate what’s causing the Job Manager to periodically refuse connections? I can see a Netty package in the first example’s stack trace – is there any option we can tune?

 

____________

 

Andreas Hailu

Data Lake Engineering | Goldman Sachs & Co.

 




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

Re: JobManager refusing connections when running many jobs in parallel?

rmetzger0
Thanks for checking.

Your analysis sounds correct. The JM is busy processing job submissions, resulting in other submissions not being accepted.

Increasing rest.connection-timeout should resolve your problem.


On Fri, Aug 7, 2020 at 1:59 AM Hailu, Andreas <[hidden email]> wrote:
Thanks for pointing this out. We had a look - the nodes in our cluster have a cap of 65K open files and we aren’t breaching 50% per metrics, so I don’t believe this is the problem.

The connection refused error makes us think it’s some process using a thread pool for the JobManager hitting capacity on a port somewhere. This sound correct? Is there a config for us to increase the pool size?

From: Robert Metzger <[hidden email]>
Sent: Wednesday, July 29, 2020 1:52:53 AM
To: Hailu, Andreas [Engineering]
Cc: [hidden email]; Shah, Siddharth [Engineering]
Subject: Re: JobManager refusing connections when running many jobs in parallel?
 
Hi Andreas,

Thanks for reaching out .. this should not happen ...
Maybe your operating system has configured low limits for the number of concurrent connections / sockets. Maybe this thread is helpful: https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections (there might better SO threads, I didn't put much effort into searching :) )

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <[hidden email]> wrote:

Hi team,

 

We’ve observed that when we submit a decent number of jobs in parallel from a single Job Master, we encounter job failures due with Connection Refused exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s seemingly transient, however, as upon several retries the job succeeds. The surface level error varies, but digging deeper in stack traces it looks to stem from the Job Manager no longer accepting connections.

 

I’ve included a couple of examples below from failed jobs’ driver logs, with different errors stemming from a connection refused error:

 

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 30 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-563.dc.gs.com: Using job manager web tracking url <a href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

... 1 more

Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)

... 6 more

Caused by: java.net.ConnectException: Connection refused

 

Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 60 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-484.dc.gs.com: Using job manager web tracking url <a href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface  (http://d43723-484.dc.gs.com:36757) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.

at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382)

at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)

at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

... 3 more

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 4 more

... (this pattern repeats for number of unique JobIDs)

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

...

26 05:46:39,734 [CASHFLOW-18394] WARN  FlinkClusterStateMonitor - Error while attempting to fetch job details for job 4d20537a676df2855e29b31b1de1ead5

com.gs.ep.data.lake.refinerlib.restful.RestfulException: failed connecting to http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 after 1 time(s)

Caused by: java.net.ConnectException: Connection refused

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at java.net.Socket.connect(Socket.java:538)

at sun.net.NetworkClient.doConnect(NetworkClient.java:180)

 

These connection refusal exceptions and their transient nature makes me think that it might be a network-related issue. It’s not uncommon for us to need to run 100+ jobs in parallel. How can we investigate what’s causing the Job Manager to periodically refuse connections? I can see a Netty package in the first example’s stack trace – is there any option we can tune?

 

____________

 

Andreas Hailu

Data Lake Engineering | Goldman Sachs & Co.

 




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

RE: JobManager refusing connections when running many jobs in parallel?

Hailu, Andreas

Interesting – what is the JobManager submission bounded by? Does it only allow a certain number of submissions per second, or is there a number of threads it accepts?

 

// ah

 

From: Robert Metzger <[hidden email]>
Sent: Tuesday, August 11, 2020 4:46 AM
To: Hailu, Andreas [Engineering] <[hidden email]>
Cc: [hidden email]; Shah, Siddharth [Engineering] <[hidden email]>
Subject: Re: JobManager refusing connections when running many jobs in parallel?

 

Thanks for checking.

 

Your analysis sounds correct. The JM is busy processing job submissions, resulting in other submissions not being accepted.

 

Increasing rest.connection-timeout should resolve your problem.

 

 

On Fri, Aug 7, 2020 at 1:59 AM Hailu, Andreas <[hidden email]> wrote:

Thanks for pointing this out. We had a look - the nodes in our cluster have a cap of 65K open files and we aren’t breaching 50% per metrics, so I don’t believe this is the problem.

The connection refused error makes us think it’s some process using a thread pool for the JobManager hitting capacity on a port somewhere. This sound correct? Is there a config for us to increase the pool size?


From: Robert Metzger <[hidden email]>
Sent: Wednesday, July 29, 2020 1:52:53 AM
To: Hailu, Andreas [Engineering]
Cc: [hidden email]; Shah, Siddharth [Engineering]
Subject: Re: JobManager refusing connections when running many jobs in parallel?

 

Hi Andreas,

 

Thanks for reaching out .. this should not happen ...

Maybe your operating system has configured low limits for the number of concurrent connections / sockets. Maybe this thread is helpful: https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections (there might better SO threads, I didn't put much effort into searching :) )

 

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <[hidden email]> wrote:

Hi team,

 

We’ve observed that when we submit a decent number of jobs in parallel from a single Job Master, we encounter job failures due with Connection Refused exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s seemingly transient, however, as upon several retries the job succeeds. The surface level error varies, but digging deeper in stack traces it looks to stem from the Job Manager no longer accepting connections.

 

I’ve included a couple of examples below from failed jobs’ driver logs, with different errors stemming from a connection refused error:

 

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 30 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-563.dc.gs.com: Using job manager web tracking url <a href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

... 1 more

Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)

... 6 more

Caused by: java.net.ConnectException: Connection refused

 

Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 60 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-484.dc.gs.com: Using job manager web tracking url <a href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface  (http://d43723-484.dc.gs.com:36757) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.

at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382)

at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)

at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

... 3 more

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 4 more

... (this pattern repeats for number of unique JobIDs)

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

...

26 05:46:39,734 [CASHFLOW-18394] WARN  FlinkClusterStateMonitor - Error while attempting to fetch job details for job 4d20537a676df2855e29b31b1de1ead5

com.gs.ep.data.lake.refinerlib.restful.RestfulException: failed connecting to http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 after 1 time(s)

Caused by: java.net.ConnectException: Connection refused

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at java.net.Socket.connect(Socket.java:538)

at sun.net.NetworkClient.doConnect(NetworkClient.java:180)

 

These connection refusal exceptions and their transient nature makes me think that it might be a network-related issue. It’s not uncommon for us to need to run 100+ jobs in parallel. How can we investigate what’s causing the Job Manager to periodically refuse connections? I can see a Netty package in the first example’s stack trace – is there any option we can tune?

 

____________

 

Andreas Hailu

Data Lake Engineering | Goldman Sachs & Co.

 

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

RE: JobManager refusing connections when running many jobs in parallel?

Hailu, Andreas
In reply to this post by rmetzger0

Hi Robert, following up - I suppose the questions distills into how would tuning  a timeout resolve connection refusals? I would think timeout-related failures may go down if the network is hammered. Connection Refused sounds like we’re out of threads or sockets somewhere, no? We’re testing out an increase in our sockets’ max connections, but I would like to know your thoughts.

 

// ah

 

From: Hailu, Andreas [Engineering]
Sent: Monday, August 17, 2020 9:51 AM
To: 'Robert Metzger' <[hidden email]>
Cc: [hidden email]; Shah, Siddharth [Engineering] <[hidden email]>
Subject: RE: JobManager refusing connections when running many jobs in parallel?

 

Interesting – what is the JobManager submission bounded by? Does it only allow a certain number of submissions per second, or is there a number of threads it accepts?

 

// ah

 

From: Robert Metzger <[hidden email]>
Sent: Tuesday, August 11, 2020 4:46 AM
To: Hailu, Andreas [Engineering] <[hidden email]>
Cc: [hidden email]; Shah, Siddharth [Engineering] <[hidden email]>
Subject: Re: JobManager refusing connections when running many jobs in parallel?

 

Thanks for checking.

 

Your analysis sounds correct. The JM is busy processing job submissions, resulting in other submissions not being accepted.

 

Increasing rest.connection-timeout should resolve your problem.

 

 

On Fri, Aug 7, 2020 at 1:59 AM Hailu, Andreas <[hidden email]> wrote:

Thanks for pointing this out. We had a look - the nodes in our cluster have a cap of 65K open files and we aren’t breaching 50% per metrics, so I don’t believe this is the problem.

The connection refused error makes us think it’s some process using a thread pool for the JobManager hitting capacity on a port somewhere. This sound correct? Is there a config for us to increase the pool size?


From: Robert Metzger <[hidden email]>
Sent: Wednesday, July 29, 2020 1:52:53 AM
To: Hailu, Andreas [Engineering]
Cc: [hidden email]; Shah, Siddharth [Engineering]
Subject: Re: JobManager refusing connections when running many jobs in parallel?

 

Hi Andreas,

 

Thanks for reaching out .. this should not happen ...

Maybe your operating system has configured low limits for the number of concurrent connections / sockets. Maybe this thread is helpful: https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections (there might better SO threads, I didn't put much effort into searching :) )

 

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas <[hidden email]> wrote:

Hi team,

 

We’ve observed that when we submit a decent number of jobs in parallel from a single Job Master, we encounter job failures due with Connection Refused exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s seemingly transient, however, as upon several retries the job succeeds. The surface level error varies, but digging deeper in stack traces it looks to stem from the Job Manager no longer accepting connections.

 

I’ve included a couple of examples below from failed jobs’ driver logs, with different errors stemming from a connection refused error:

 

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 30 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-563.dc.gs.com: Using job manager web tracking url <a href="http://d43723-563.dc.gs.com:41268"> Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)

at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)

... 1 more

Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)

... 6 more

Caused by: java.net.ConnectException: Connection refused

 

Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 Task Manager memory - 60 jobs submitted in parallel, each with parallelism of 1

Job Manager is running @ d43723-484.dc.gs.com: Using job manager web tracking url <a href="http://d43723-484.dc.gs.com:36757"> Job Manager Web Interface  (http://d43723-484.dc.gs.com:36757) </a>

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result. (JobID: 9c4a797df26b510a92a843c756dc4b3d)

at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)

at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)

at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)

...

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.

at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:382)

at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)

at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

... 3 more

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 4 more

... (this pattern repeats for number of unique JobIDs)

Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Could not upload job files.]

at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

...

26 05:46:39,734 [CASHFLOW-18394] WARN  FlinkClusterStateMonitor - Error while attempting to fetch job details for job 4d20537a676df2855e29b31b1de1ead5

com.gs.ep.data.lake.refinerlib.restful.RestfulException: failed connecting to http://d43723-484.dc.gs.com:36757/jobs/4d20537a676df2855e29b31b1de1ead5 after 1 time(s)

Caused by: java.net.ConnectException: Connection refused

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

at java.net.Socket.connect(Socket.java:589)

at java.net.Socket.connect(Socket.java:538)

at sun.net.NetworkClient.doConnect(NetworkClient.java:180)

 

These connection refusal exceptions and their transient nature makes me think that it might be a network-related issue. It’s not uncommon for us to need to run 100+ jobs in parallel. How can we investigate what’s causing the Job Manager to periodically refuse connections? I can see a Netty package in the first example’s stack trace – is there any option we can tune?

 

____________

 

Andreas Hailu

Data Lake Engineering | Goldman Sachs & Co.

 

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices