Hello,
I'm not sure whether the problem is connected with bad configuration or it's some inconsistency in the documentation but according to this document: https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture. If a job fails, all non-HA files' refCounts are reset to 0; all HA files' refCounts remain and will not be increased again on recovery. But in the JobManager's code if the Job Status is changed to failed and the JobManager receive the message with that fact, it will send RemoveJob message to itself, which invokes removeJob() function that always invokes following functions : libraryCacheManager.unregisterJob(jobID)As far as I understand this removes blob entries immediately. And according to the doc it should only freeze refCounts for HA files and reset refCounts for non-Ha files to allow their later removal. Is the doc right and I have missed something here ? Thanks in Advance. |
hmm, this indeed looks odd. Looping in
Till (cc) who might know more about this.
On 20.06.2018 16:43, Dominik Wosiński wrote:
|
Hi Dominik, all job related files (non-HA as well as HA) are removed once the job reaches a globally terminal state (FINISHED, CANCELLED, FAILED). This is the case because Flink assumes that the job is done and won't be retried afterwards. Thus, the documentation in the Flip is not true and should be corrected. Cheers, Till On Wed, Jun 20, 2018 at 7:11 PM Chesnay Schepler <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |