Flink 1.1.3 RollingSink - understanding output blocks/parallelism

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.1.3 RollingSink - understanding output blocks/parallelism

Dominik Safaric
Hi everyone,

although this question might sound trivial, I’ve been curious about the following. Given a Flink topology with parallelism level set to 6 for example and outputting the data stream to HDFS using an instance RollingSink, how is the output file structured? By structured, I refer to the fact that this will result in 6 distinct block files, whereas I would like to have a single file containing all of the output values from the DataStream.

Regards,
Dominik
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 RollingSink - understanding output blocks/parallelism

Aljoscha Krettek
Hi Dominik,
I think having a single output file is only possible if you set the parallelism of the sink to 1. AFAIK it is not possible to concurrently write to a single HDFS file from multiple clients.

Cheers,
Aljoscha

On Wed, 14 Dec 2016 at 20:57 Dominik Safaric <[hidden email]> wrote:
Hi everyone,

although this question might sound trivial, I’ve been curious about the following. Given a Flink topology with parallelism level set to 6 for example and outputting the data stream to HDFS using an instance RollingSink, how is the output file structured? By structured, I refer to the fact that this will result in 6 distinct block files, whereas I would like to have a single file containing all of the output values from the DataStream.

Regards,
Dominik