Strange DataSet behavior when using custom FileInputFormat

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange DataSet behavior when using custom FileInputFormat

Hynek Noll
Hi,
I'm trying to implement a custom FileInputFormat (to read the MNIST Dataset).
The creation of Flink DataSet (DataSet<byte[]> matrices) seems to be OK, but when I try to print it using either
matrices.print();
or
matrices.collect();

It finishes with exit code -17.
(Before, I compiled using Java 11 and aside from a reflection warning, this approach caused the program to run indefinitely. Now I use JDK 8)
The total number of elements is 60 000. Now the strange thing is that when I run

matrices.first(60000).print();

it does print the elements just fine. But my understanding is that these two approaches should work the same way, if there are exactly 60 000 records.

Is this a bug? Or something that can be explained by my extension of FileInputFormat (I might very well not use it correctly)?

Best regards,
Hynek
Reply | Threaded
Open this post in threaded view
|

Re: Strange DataSet behavior when using custom FileInputFormat

Zhu Zhu
Hi Hynek,

In execution, matrices.first(60000).print() is different from matrices.print(). 
It is adding a reducer operator to the job which only collects the first 6000 records from the source.
So if your InputFormat can generate more than 60000 (which can be unexpected though), and the trailing data are some how corrupted, this case may happen.

Do you have the detailed error info that why your program exits?
That can be helpful to identify the root cause.

Thanks,
Zhu Zhu


Hynek Noll <[hidden email]> 于2019年8月9日周五 下午8:59写道:
Hi,
I'm trying to implement a custom FileInputFormat (to read the MNIST Dataset).
The creation of Flink DataSet (DataSet<byte[]> matrices) seems to be OK, but when I try to print it using either
matrices.print();
or
matrices.collect();

It finishes with exit code -17.
(Before, I compiled using Java 11 and aside from a reflection warning, this approach caused the program to run indefinitely. Now I use JDK 8)
The total number of elements is 60 000. Now the strange thing is that when I run

matrices.first(60000).print();

it does print the elements just fine. But my understanding is that these two approaches should work the same way, if there are exactly 60 000 records.

Is this a bug? Or something that can be explained by my extension of FileInputFormat (I might very well not use it correctly)?

Best regards,
Hynek