Flink failed when can not connect to BlobServer

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink failed when can not connect to BlobServer

Si-li Liu
Hi, all

I use Flink DataSet API to do some batch job, read some log then group and sort them. Our cluster has almost 2000 servers, we get used to use traditional MR job, then I tried Flink to do some experiment job, but I counter this error and can not continue, does anyone can help with it?

Our MR jobs also counter such connection error sometimes, but it will retry serval times then get success. It seems that the whole calculation process failed when one single task failed in Flink.

java.io.IOException: Cannot get library with hash 858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:262)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:116)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:472)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 858478de9791c1a5fbbb138c02ec182b916f7962 from /10.132.99.150:42927 and store it under /tmp/blobStore-a2b79e70-74b9-49e8-a5bb-f2842aeec3b0/cache/blob_858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:253)
	... 4 more
Caused by: java.io.IOException: Could not connect to BlobServer at address /10.132.99.150:42927
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:88)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:124)
	... 5 more
Caused by: java.net.ConnectException: Connection timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:84)
	... 6 more

--
Best regards

Sili Liu
Reply | Threaded
Open this post in threaded view
|

Re: Flink failed when can not connect to BlobServer

Stephan Ewen
Hi!

The Blob server runs on the JobManager and is used to distribute JAR files.

The best way to handle this scale is the following:

 Option (1)  Use the 1.2-SNAPSHOT version to run Flink on YARN, it will add the JAR files to the Job's YARN resources - so no BLOBs need to be fetched.
 Option (2)  Manually add your JAR files to the lib folder


If you cannot do that, you can try and configure the BLOB server to handle more connections. Especially increase the backlog and number of connections.
See here for the config options: 
  • blob.fetch.num-concurrent: The number concurrent BLOB fetches (such as JAR file downloads) that the JobManager serves (DEFAULT: 50).
  • blob.fetch.backlog: The maximum number of queued BLOB fetches (such as JAR file downloads) that the JobManager allows (DEFAULT: 1000).
  • blob.fetch.retries: The number of retries for the TaskManager to download BLOBs (such as JAR files) from the JobManager (DEFAULT: 50).

In the longer run, we are thinking to create a version of the BLOB server that distributed files via a DFS (for example HDFS).

Greetings,
Stephan



On Mon, Nov 7, 2016 at 10:02 AM, Si-li Liu <[hidden email]> wrote:
Hi, all

I use Flink DataSet API to do some batch job, read some log then group and sort them. Our cluster has almost 2000 servers, we get used to use traditional MR job, then I tried Flink to do some experiment job, but I counter this error and can not continue, does anyone can help with it?

Our MR jobs also counter such connection error sometimes, but it will retry serval times then get success. It seems that the whole calculation process failed when one single task failed in Flink.

java.io.IOException: Cannot get library with hash 858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:262)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:116)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:472)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 858478de9791c1a5fbbb138c02ec182b916f7962 from /10.132.99.150:42927 and store it under /tmp/blobStore-a2b79e70-74b9-49e8-a5bb-f2842aeec3b0/cache/blob_858478de9791c1a5fbbb138c02ec182b916f7962
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:253)
	... 4 more
Caused by: java.io.IOException: Could not connect to BlobServer at address /10.132.99.150:42927
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:88)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:124)
	... 5 more
Caused by: java.net.ConnectException: Connection timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:84)
	... 6 more

--
Best regards

Sili Liu