Restart Flink in Yarn

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Restart Flink in Yarn

Dominique Rondé-2
Hi @all,

i have a yarn cluster with 5 Nodes with a running flink (0.10.2) instance. Today we shut down one of the Yarn-Hosts due to maintance reasons. After the restart we have some flink streaming routes in a restarting status (see stacktrace below). Now I want to restart these routes to continue their work from the last checkpoint. What can i do?

Greets
Dominique

Stacktrace
===================================================================================
java.io.IOException: Cannot get library with hash 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:254)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:114)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:710)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:471)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317 from /10.24.20.14:60485 and store it under /tmp/blobStore-efdeddf9-d096-440f-a4cb-9c79334ff92c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:245)
	... 4 more
Caused by: java.io.IOException: GET operation failed: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:165)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:125)
	... 5 more
Caused by: java.io.IOException: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobClient.receiveAndCheckResponse(BlobClient.java:213)
	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:159)
	... 6 more
Caused by: java.io.IOException: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:202)
	at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:112)

Reply | Threaded
Open this post in threaded view
|

Re: Restart Flink in Yarn

rmetzger0
Hi Dominic,
I'm sorry that you ran into this issue.
What do you mean by "flink streaming routes" ?

Regarding the second question: "Now I want to restart these routes to continue their work from the last checkpoint. What can i do?"
I think the feature you are looking for are savepoints: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
However, this has been added to Flink in 1.0, so its not available in your 0.10 release.


I have to admit that I haven't seen the "Cannot find required BLOB at ..." exceptions before. Is there any chance that the files have been deleted from the /tmp directory by any external service (like a periodic cleanup script?) or has the /tmp dir been mounted to another disk in the meantime?



On Wed, May 4, 2016 at 6:27 PM, Dominique Rondé <[hidden email]> wrote:
Hi @all,

i have a yarn cluster with 5 Nodes with a running flink (0.10.2) instance. Today we shut down one of the Yarn-Hosts due to maintance reasons. After the restart we have some flink streaming routes in a restarting status (see stacktrace below). Now I want to restart these routes to continue their work from the last checkpoint. What can i do?

Greets
Dominique

Stacktrace
===================================================================================
java.io.IOException: Cannot get library with hash 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:254)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:114)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:710)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:471)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317 from /10.24.20.14:60485 and store it under /tmp/blobStore-efdeddf9-d096-440f-a4cb-9c79334ff92c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:245)
	... 4 more
Caused by: java.io.IOException: GET operation failed: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:165)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:125)
	... 5 more
Caused by: java.io.IOException: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobClient.receiveAndCheckResponse(BlobClient.java:213)
	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:159)
	... 6 more
Caused by: java.io.IOException: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:202)
	at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:112)


Reply | Threaded
Open this post in threaded view
|

Re: Restart Flink in Yarn

Ufuk Celebi
Hey Dominique!

Are you running the job in HA mode?

– Ufuk

On Thu, May 5, 2016 at 1:49 PM, Robert Metzger <[hidden email]> wrote:

> Hi Dominic,
> I'm sorry that you ran into this issue.
> What do you mean by "flink streaming routes" ?
>
> Regarding the second question: "Now I want to restart these routes to
> continue their work from the last checkpoint. What can i do?"
> I think the feature you are looking for are savepoints:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
> However, this has been added to Flink in 1.0, so its not available in your
> 0.10 release.
>
>
> I have to admit that I haven't seen the "Cannot find required BLOB at ..."
> exceptions before. Is there any chance that the files have been deleted from
> the /tmp directory by any external service (like a periodic cleanup script?)
> or has the /tmp dir been mounted to another disk in the meantime?
>
>
>
> On Wed, May 4, 2016 at 6:27 PM, Dominique Rondé
> <[hidden email]> wrote:
>>
>> Hi @all,
>>
>> i have a yarn cluster with 5 Nodes with a running flink (0.10.2) instance.
>> Today we shut down one of the Yarn-Hosts due to maintance reasons. After the
>> restart we have some flink streaming routes in a restarting status (see
>> stacktrace below). Now I want to restart these routes to continue their work
>> from the last checkpoint. What can i do?
>>
>> Greets
>> Dominique
>>
>> Stacktrace
>>
>> ===================================================================================
>>
>> java.io.IOException: Cannot get library with hash
>> 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> at
>> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:254)
>> at
>> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:114)
>> at
>> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:710)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:471)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.IOException: Failed to fetch BLOB
>> 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317 from /10.24.20.14:60485 and store
>> it under
>> /tmp/blobStore-efdeddf9-d096-440f-a4cb-9c79334ff92c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
>> at
>> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:245)
>> ... 4 more
>> Caused by: java.io.IOException: GET operation failed: Server side error:
>> Cannot find required BLOB at
>> /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:165)
>> at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:125)
>> ... 5 more
>> Caused by: java.io.IOException: Server side error: Cannot find required
>> BLOB at
>> /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> at
>> org.apache.flink.runtime.blob.BlobClient.receiveAndCheckResponse(BlobClient.java:213)
>> at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:159)
>> ... 6 more
>> Caused by: java.io.IOException: Cannot find required BLOB at
>> /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> at
>> org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:202)
>> at
>> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:112)
>>
>>
>