(DEPRECATED) Apache Flink User Mailing List archive.

Flink failed when can not connect to BlobServer

Classic

List

Threaded

2 messages Options

Si-li Liu

Flink failed when can not connect to BlobServer

Hi, all

I use Flink DataSet API to do some batch job, read some log then group and sort them. Our cluster has almost 2000 servers, we get used to use traditional MR job, then I tried Flink to do some experiment job, but I counter this error and can not continue, does anyone can help with it?

Our MR jobs also counter such connection error sometimes, but it will retry serval times then get success. It seems that the whole calculation process failed when one single task failed in Flink.

java.io.IOException: Cannot get library with hash 858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:262)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:116)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:472)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 858478de9791c1a5fbbb138c02ec182b916f7962 from /10.132.99.150:42927 and store it under /tmp/blobStore-a2b79e70-74b9-49e8-a5bb-f2842aeec3b0/cache/blob_858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:253)
	... 4 more
Caused by: java.io.IOException: Could not connect to BlobServer at address /10.132.99.150:42927
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:88)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:124)
	... 5 more
Caused by: java.net.ConnectException: Connection timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:84)
	... 6 more

Best regards

Sili Liu

Stephan Ewen

Re: Flink failed when can not connect to BlobServer

Hi!

The Blob server runs on the JobManager and is used to distribute JAR files.

The best way to handle this scale is the following:

Option (1) Use the 1.2-SNAPSHOT version to run Flink on YARN, it will add the JAR files to the Job's YARN resources - so no BLOBs need to be fetched.

Option (2) Manually add your JAR files to the lib folder

If you cannot do that, you can try and configure the BLOB server to handle more connections. Especially increase the backlog and number of connections.

See here for the config options:

blob.fetch.num-concurrent: The number concurrent BLOB fetches (such as JAR file downloads) that the JobManager serves (DEFAULT: 50).
blob.fetch.backlog: The maximum number of queued BLOB fetches (such as JAR file downloads) that the JobManager allows (DEFAULT: 1000).
blob.fetch.retries: The number of retries for the TaskManager to download BLOBs (such as JAR files) from the JobManager (DEFAULT: 50).

In the longer run, we are thinking to create a version of the BLOB server that distributed files via a DFS (for example HDFS).

Greetings,
Stephan

On Mon, Nov 7, 2016 at 10:02 AM, Si-li Liu <[hidden email]> wrote:

Hi, all

Our MR jobs also counter such connection error sometimes, but it will retry serval times then get success. It seems that the whole calculation process failed when one single task failed in Flink.

java.io.IOException: Cannot get library with hash 858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:262)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:116)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:472)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 858478de9791c1a5fbbb138c02ec182b916f7962 from /10.132.99.150:42927 and store it under /tmp/blobStore-a2b79e70-74b9-49e8-a5bb-f2842aeec3b0/cache/blob_858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:253)
	... 4 more
Caused by: java.io.IOException: Could not connect to BlobServer at address /10.132.99.150:42927
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:88)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:124)
	... 5 more
Caused by: java.net.ConnectException: Connection timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:84)
	... 6 more

Best regards

Sili Liu