Job fails with FileNotFoundException from blobStore

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Job fails with FileNotFoundException from blobStore

Robert Waury
Hi,

I'm suddenly getting FileNotFoundExceptions because the blobStore cannot find files in /tmp

The job used work in the exact same setup (same versions, same cluster, same input files).

Flink version: 0.8 release
HDFS: 2.3.0-cdh5.1.2
Any idea what could be the reason behind this?

Cheers,
Robert
Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Stephan Ewen

Hey Robert!

On which version are you? 0.8 or 0.9- SNAPSHOT?

Am 04.02.2015 14:49 schrieb "Robert Waury" <[hidden email]>:
Hi,

I'm suddenly getting FileNotFoundExceptions because the blobStore cannot find files in /tmp

The job used work in the exact same setup (same versions, same cluster, same input files).

Flink version: 0.8 release
HDFS: 2.3.0-cdh5.1.2
Any idea what could be the reason behind this?

Cheers,
Robert
Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Robert Waury
I compiled from the release-0.8 branch.

On Thu, Feb 5, 2015 at 8:55 AM, Stephan Ewen <[hidden email]> wrote:

Hey Robert!

On which version are you? 0.8 or 0.9- SNAPSHOT?

Am 04.02.2015 14:49 schrieb "Robert Waury" <[hidden email]>:

Hi,

I'm suddenly getting FileNotFoundExceptions because the blobStore cannot find files in /tmp

The job used work in the exact same setup (same versions, same cluster, same input files).

Flink version: 0.8 release
HDFS: 2.3.0-cdh5.1.2
Any idea what could be the reason behind this?

Cheers,
Robert

Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Ufuk Celebi
Hey Robert,

is this error reproducible?

I've looked into the blob store and the error occurs when the blob cache tries to *create* a local file before requesting it from the job manager.

I will add a check to the blob store to ensure that the parent directories have been created. Other than that, I currently have no clue what the problem might be.

Can you verify whether the parent directory has been created on your machines? You could do this before you submit the tasks. I will add the respective log messages to the blob cache and push to 0.8/master.

– Ufuk

On 05 Feb 2015, at 09:27, Robert Waury <[hidden email]> wrote:

> I compiled from the release-0.8 branch.
>
> On Thu, Feb 5, 2015 at 8:55 AM, Stephan Ewen <[hidden email]> wrote:
> Hey Robert!
>
> On which version are you? 0.8 or 0.9- SNAPSHOT?
>
> Am 04.02.2015 14:49 schrieb "Robert Waury" <[hidden email]>:
>
> Hi,
>
> I'm suddenly getting FileNotFoundExceptions because the blobStore cannot find files in /tmp
>
> The job used work in the exact same setup (same versions, same cluster, same input files).
>
> Flink version: 0.8 release
> HDFS: 2.3.0-cdh5.1.2
>
> Flink trace:
> http://pastebin.com/SKdwp6Yt
>
> Any idea what could be the reason behind this?
>
> Cheers,
> Robert
>

Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Robert Waury
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

Cheers,
Robert




On Thu, Feb 5, 2015 at 11:01 AM, Ufuk Celebi <[hidden email]> wrote:
Hey Robert,

is this error reproducible?

I've looked into the blob store and the error occurs when the blob cache tries to *create* a local file before requesting it from the job manager.

I will add a check to the blob store to ensure that the parent directories have been created. Other than that, I currently have no clue what the problem might be.

Can you verify whether the parent directory has been created on your machines? You could do this before you submit the tasks. I will add the respective log messages to the blob cache and push to 0.8/master.

– Ufuk

On 05 Feb 2015, at 09:27, Robert Waury <[hidden email]> wrote:

> I compiled from the release-0.8 branch.
>
> On Thu, Feb 5, 2015 at 8:55 AM, Stephan Ewen <[hidden email]> wrote:
> Hey Robert!
>
> On which version are you? 0.8 or 0.9- SNAPSHOT?
>
> Am 04.02.2015 14:49 schrieb "Robert Waury" <[hidden email]>:
>
> Hi,
>
> I'm suddenly getting FileNotFoundExceptions because the blobStore cannot find files in /tmp
>
> The job used work in the exact same setup (same versions, same cluster, same input files).
>
> Flink version: 0.8 release
> HDFS: 2.3.0-cdh5.1.2
>
> Flink trace:
> http://pastebin.com/SKdwp6Yt
>
> Any idea what could be the reason behind this?
>
> Cheers,
> Robert
>


Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Ufuk Celebi
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.


Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Robert Waury
I talked with the admins. The problem seemed to have been that the disk was full and Flink couldn't create the directory.

Maybe the the error message should reflect if that is the cause.

While cleaning up the disk we noticed that a lot of temporary blobStore files were not deleted by Flink after the job finished. This seemed to have caused or at least worsened the problem.

Cheers,
Robert

On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <[hidden email]> wrote:
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.



Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Till Rohrmann
Hi Robert,

thanks for the info. If the TaskManager/JobManager does not shutdown properly, i.e. killing of the process, then it is indeed the case that the BlobManager cannot properly remove all stored files. I don't know if this was lately the case for you. Furthermore, the files are not directly deleted after the job has finished. Internally there is a cleanup task which is triggered every our and deletes all blobs which are no longer referenced.

But we definitely have to look into it to see how we could improve this behaviour.

Greets,

Till

On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <[hidden email]> wrote:
I talked with the admins. The problem seemed to have been that the disk was full and Flink couldn't create the directory.

Maybe the the error message should reflect if that is the cause.

While cleaning up the disk we noticed that a lot of temporary blobStore files were not deleted by Flink after the job finished. This seemed to have caused or at least worsened the problem.

Cheers,
Robert

On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <[hidden email]> wrote:
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.




Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Ufuk Celebi
Thank you very much, Robert!

The problem is that the job/task manager shutdown methods are never called. When using the scripts, the task/job manager processes get killed and therefore shutdown methods are never called.

@Till: Do you know whether there is a mechanism in Akka to register the actors for JVM shutdown hooks? I tried to register a shutdown hook via Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a reference to the task manager.


On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <[hidden email]> wrote:
Hi Robert,

thanks for the info. If the TaskManager/JobManager does not shutdown properly, i.e. killing of the process, then it is indeed the case that the BlobManager cannot properly remove all stored files. I don't know if this was lately the case for you. Furthermore, the files are not directly deleted after the job has finished. Internally there is a cleanup task which is triggered every our and deletes all blobs which are no longer referenced.

But we definitely have to look into it to see how we could improve this behaviour.

Greets,

Till

On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <[hidden email]> wrote:
I talked with the admins. The problem seemed to have been that the disk was full and Flink couldn't create the directory.

Maybe the the error message should reflect if that is the cause.

While cleaning up the disk we noticed that a lot of temporary blobStore files were not deleted by Flink after the job finished. This seemed to have caused or at least worsened the problem.

Cheers,
Robert

On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <[hidden email]> wrote:
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.





Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Till Rohrmann
Hmm this is not very gentleman-like to terminate the Job/TaskManagers. I'll check how the ActorSystem behaves in case of killing the process.

Why can't we implement a more graceful termination mechanism? For example, we could send a termination message to the JobManager and TaskManagers.

On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <[hidden email]> wrote:
Thank you very much, Robert!

The problem is that the job/task manager shutdown methods are never called. When using the scripts, the task/job manager processes get killed and therefore shutdown methods are never called.

@Till: Do you know whether there is a mechanism in Akka to register the actors for JVM shutdown hooks? I tried to register a shutdown hook via Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a reference to the task manager.


On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <[hidden email]> wrote:
Hi Robert,

thanks for the info. If the TaskManager/JobManager does not shutdown properly, i.e. killing of the process, then it is indeed the case that the BlobManager cannot properly remove all stored files. I don't know if this was lately the case for you. Furthermore, the files are not directly deleted after the job has finished. Internally there is a cleanup task which is triggered every our and deletes all blobs which are no longer referenced.

But we definitely have to look into it to see how we could improve this behaviour.

Greets,

Till

On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <[hidden email]> wrote:
I talked with the admins. The problem seemed to have been that the disk was full and Flink couldn't create the directory.

Maybe the the error message should reflect if that is the cause.

While cleaning up the disk we noticed that a lot of temporary blobStore files were not deleted by Flink after the job finished. This seemed to have caused or at least worsened the problem.

Cheers,
Robert

On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <[hidden email]> wrote:
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.






Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Stephan Ewen
I think that process killing (HALT signal) is a very typical way in Linux to shut down processes. It is the most robust way, since it does not require to send any custom messages to the process.

This is sort of graceful, as the JVM gets the signal and may do a lot of things before shutting down, such as running shutdown hooks. The ungraceful variant is the KILL signal, which just removes the process.



On Thu, Feb 5, 2015 at 4:16 PM, Till Rohrmann <[hidden email]> wrote:
Hmm this is not very gentleman-like to terminate the Job/TaskManagers. I'll check how the ActorSystem behaves in case of killing the process.

Why can't we implement a more graceful termination mechanism? For example, we could send a termination message to the JobManager and TaskManagers.

On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <[hidden email]> wrote:
Thank you very much, Robert!

The problem is that the job/task manager shutdown methods are never called. When using the scripts, the task/job manager processes get killed and therefore shutdown methods are never called.

@Till: Do you know whether there is a mechanism in Akka to register the actors for JVM shutdown hooks? I tried to register a shutdown hook via Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a reference to the task manager.


On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <[hidden email]> wrote:
Hi Robert,

thanks for the info. If the TaskManager/JobManager does not shutdown properly, i.e. killing of the process, then it is indeed the case that the BlobManager cannot properly remove all stored files. I don't know if this was lately the case for you. Furthermore, the files are not directly deleted after the job has finished. Internally there is a cleanup task which is triggered every our and deletes all blobs which are no longer referenced.

But we definitely have to look into it to see how we could improve this behaviour.

Greets,

Till

On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <[hidden email]> wrote:
I talked with the admins. The problem seemed to have been that the disk was full and Flink couldn't create the directory.

Maybe the the error message should reflect if that is the cause.

While cleaning up the disk we noticed that a lot of temporary blobStore files were not deleted by Flink after the job finished. This seemed to have caused or at least worsened the problem.

Cheers,
Robert

On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <[hidden email]> wrote:
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.







Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Ufuk Celebi
After talking to Robert and Till offline, what about the following:

- We add a shutdown hook to the blob library cache manager to shutdown the blob service (just a delete call)
- As Robert pointed out, we cannot do this with the IOManager paths right now, because they are essentially shared among multiple Flink instances. Therefore we add an IOManager directory per Flink instance as well, which we can simply delete on shutdown.

Is that OK?

On Thu, Feb 5, 2015 at 4:23 PM, Stephan Ewen <[hidden email]> wrote:
I think that process killing (HALT signal) is a very typical way in Linux to shut down processes. It is the most robust way, since it does not require to send any custom messages to the process.

This is sort of graceful, as the JVM gets the signal and may do a lot of things before shutting down, such as running shutdown hooks. The ungraceful variant is the KILL signal, which just removes the process.



On Thu, Feb 5, 2015 at 4:16 PM, Till Rohrmann <[hidden email]> wrote:
Hmm this is not very gentleman-like to terminate the Job/TaskManagers. I'll check how the ActorSystem behaves in case of killing the process.

Why can't we implement a more graceful termination mechanism? For example, we could send a termination message to the JobManager and TaskManagers.

On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <[hidden email]> wrote:
Thank you very much, Robert!

The problem is that the job/task manager shutdown methods are never called. When using the scripts, the task/job manager processes get killed and therefore shutdown methods are never called.

@Till: Do you know whether there is a mechanism in Akka to register the actors for JVM shutdown hooks? I tried to register a shutdown hook via Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a reference to the task manager.


On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <[hidden email]> wrote:
Hi Robert,

thanks for the info. If the TaskManager/JobManager does not shutdown properly, i.e. killing of the process, then it is indeed the case that the BlobManager cannot properly remove all stored files. I don't know if this was lately the case for you. Furthermore, the files are not directly deleted after the job has finished. Internally there is a cleanup task which is triggered every our and deletes all blobs which are no longer referenced.

But we definitely have to look into it to see how we could improve this behaviour.

Greets,

Till

On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <[hidden email]> wrote:
I talked with the admins. The problem seemed to have been that the disk was full and Flink couldn't create the directory.

Maybe the the error message should reflect if that is the cause.

While cleaning up the disk we noticed that a lot of temporary blobStore files were not deleted by Flink after the job finished. This seemed to have caused or at least worsened the problem.

Cheers,
Robert

On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <[hidden email]> wrote:
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.








Reply | Threaded
Open this post in threaded view
|

Re: Job fails with FileNotFoundException from blobStore

Stephan Ewen
Sounds good. In the course of this, we should probably extend the IOManager that it keeps track of temp files and deletes them when a task is done.

On Thu, Feb 5, 2015 at 4:40 PM, Ufuk Celebi <[hidden email]> wrote:
After talking to Robert and Till offline, what about the following:

- We add a shutdown hook to the blob library cache manager to shutdown the blob service (just a delete call)
- As Robert pointed out, we cannot do this with the IOManager paths right now, because they are essentially shared among multiple Flink instances. Therefore we add an IOManager directory per Flink instance as well, which we can simply delete on shutdown.

Is that OK?

On Thu, Feb 5, 2015 at 4:23 PM, Stephan Ewen <[hidden email]> wrote:
I think that process killing (HALT signal) is a very typical way in Linux to shut down processes. It is the most robust way, since it does not require to send any custom messages to the process.

This is sort of graceful, as the JVM gets the signal and may do a lot of things before shutting down, such as running shutdown hooks. The ungraceful variant is the KILL signal, which just removes the process.



On Thu, Feb 5, 2015 at 4:16 PM, Till Rohrmann <[hidden email]> wrote:
Hmm this is not very gentleman-like to terminate the Job/TaskManagers. I'll check how the ActorSystem behaves in case of killing the process.

Why can't we implement a more graceful termination mechanism? For example, we could send a termination message to the JobManager and TaskManagers.

On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <[hidden email]> wrote:
Thank you very much, Robert!

The problem is that the job/task manager shutdown methods are never called. When using the scripts, the task/job manager processes get killed and therefore shutdown methods are never called.

@Till: Do you know whether there is a mechanism in Akka to register the actors for JVM shutdown hooks? I tried to register a shutdown hook via Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a reference to the task manager.


On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <[hidden email]> wrote:
Hi Robert,

thanks for the info. If the TaskManager/JobManager does not shutdown properly, i.e. killing of the process, then it is indeed the case that the BlobManager cannot properly remove all stored files. I don't know if this was lately the case for you. Furthermore, the files are not directly deleted after the job has finished. Internally there is a cleanup task which is triggered every our and deletes all blobs which are no longer referenced.

But we definitely have to look into it to see how we could improve this behaviour.

Greets,

Till

On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <[hidden email]> wrote:
I talked with the admins. The problem seemed to have been that the disk was full and Flink couldn't create the directory.

Maybe the the error message should reflect if that is the cause.

While cleaning up the disk we noticed that a lot of temporary blobStore files were not deleted by Flink after the job finished. This seemed to have caused or at least worsened the problem.

Cheers,
Robert

On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <[hidden email]> wrote:
On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury <[hidden email]> wrote:
Hi,

I can reproduce the error on my cluster.

Unfortunately I can't check whether the parent directories were created on the different nodes since I have no way of accessing them. I start all the jobs from a gateway.

I've added a check to the directory creation (in branches release-0.8 and master), which should fail with a proper error message if that is the problem. If you have time to (re)deploy Flink, it would be great to know if that indeed is the issue. Otherwise, we need to further investigate this.