Hi,
thanks for letting us know about this shortcoming.
I will link someone from the runtime team in the JIRA issue. Let's 
continue the discussion there.
Regards,
Timo
On 22.10.20 05:36, chenkaibit wrote:
> Hi everyone:
>   I met this Exception when a hard disk was damaged:
> 
https://issues.apache.org/jira/secure/attachment/13009035/13009035_flink_disk_error.png 
> 
> 
> I checked the code and found that flink will create a temp file  when 
> Record length > 5 MB:
> 
> // SpillingAdaptiveSpanningRecordDeserializer.java if  (nextRecordLength > THRESHOLD_FOR_SPILLING) {
>     // create a spilling channel and put the data there     this.spillingChannel = createSpillingChannel();
> 
>     ByteBuffer toWrite = partial.segment.wrap(partial.position, numBytesChunk);
>     FileUtils.writeCompletely(this.spillingChannel, toWrite);
> }
> 
> 
> The tempDir is random picked from all `tempDirs`. Well on yarn mode, one 
> `tempDir`  usually represents one hard disk.
> In may opinion, if a hard disk is damaged, taskmanager should pick 
> another disk(tmpDir) for Spilling Channel, rather than throw an 
> IOException, which causes flink job restart over and over again.
> 
> I have created a jira issue ( 
> 
https://issues.apache.org/jira/browse/FLINK-18811 )  to track this. And 
> I'm looking forward someone could help review the code or discuss about 
> this issue.
> thanks!