MultipleFileOutput based on field

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

MultipleFileOutput based on field

Yiannis Gkoufas
Hi there,

is it possible to write the results to HDFS on different files based a field of a tuple?

Thanks a lot!
Reply | Threaded
Open this post in threaded view
|

Re: MultipleFileOutput based on field

rmetzger0
Hi,

right now, there is no shiny API in Flink to do this directly, but you can use Hadoop's MultipleTextOutputFormat with Flink's HadoopOutputFormat wrapper:

The example looks quite messy but worked well locally. 
It should also work on clusters (I haven't tested it).


There is always another way to solve these kinds of issues, using a "tagged" DataSet.
DataSet<String> start;
DataSet<Tuple2<Int, String>> tagged = start.doSomething( { return new Tuple2(<putOutputNumberHere>, "str"); } );
DataSet<String> out1 = tagged.filter( in.f1 == 0 );
DataSet<String> out2 = tagged.filter( in.f1 == 1 );
DataSet<String> out3 = tagged.filter( in.f1 == 2 );

and then you can write out the DataSet's out1 - out3 to separate files.
With this approach, you can "simulate" directing outputs from "doSomething()" into different transformation chains / file outputs.

Best,
Robert


On Mon, Feb 23, 2015 at 4:43 PM, Yiannis Gkoufas <[hidden email]> wrote:
Hi there,

is it possible to write the results to HDFS on different files based a field of a tuple?

Thanks a lot!