flink job will restart over and over again if a taskmanager's disk damages
Posted by
chenkaibit on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/flink-job-will-restart-over-and-over-again-if-a-taskmanager-s-disk-damages-tp38862.html
Hi everyone:
I met this Exception when a hard disk was damaged:
I checked the code and found that flink will create a temp file when Record length > 5 MB:
if (nextRecordLength > THRESHOLD_FOR_SPILLING) {
this.spillingChannel = createSpillingChannel();
ByteBuffer toWrite = partial.segment.wrap(partial.position, numBytesChunk);
FileUtils.writeCompletely(this.spillingChannel, toWrite);
}
The tempDir is random picked from all `tempDirs`. Well on yarn mode, one `tempDir` usually represents one hard disk.
In may opinion, if a hard disk is damaged, taskmanager should pick another disk(tmpDir) for Spilling Channel, rather than throw an IOException, which causes flink job restart over and over again.
thanks!