BucketingSink vs StreamingFileSink

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

BucketingSink vs StreamingFileSink

Edward Rojas
Hello,
We are currently using Flink 1.5 and we use the BucketingSink to save the
result of job processing to HDFS.
The data is in JSON format and we store one object per line in the resulting
files.

We are planning to upgrade to Flink 1.6 and we see that there is this new
StreamingFileSink,  from the description it looks very similar to
BucketingSink when using Row-encoded Output Format, my question is, should
we consider to move to StreamingFileSink?

I would like to better understand what are the suggested use cases for each
of the two options now (?)

We are also considering to additionally output the data in Parquet format
for data scientists (to be stored in HDFS as well), for this I see some
utils to work with StreamingFileSink, so I guess for this case it's
recommended to use that option(?).
Is it possible to use the Parquet writers even when the schema of the data
may evolve ?

Thanks in advance for your help.
(Sorry if I put too many questions in the same message)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: BucketingSink vs StreamingFileSink

Andrey Zagrebin
Hi,

StreamingFileSink is supposed to subsume BucketingSink which will be deprecated.

StreamingFileSink fixes some issues of BucketingSink, especially with AWS s3
and adds more flexibility with defining the rolling policy.

StreamingFileSink does not support older hadoop versions at the moment,
but there are ideas how to resolve this.

You can have a look how to use StreamingFileSink with Parquet here [1].

I also cc’ed Kostas, he might add more to this topic.

Best,
Andrey

[1] https://github.com/apache/flink/blob/0b4947b6142f813d2f1e0e662d0fefdecca0e382/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

> On 16 Nov 2018, at 11:31, Edward Rojas <[hidden email]> wrote:
>
> Hello,
> We are currently using Flink 1.5 and we use the BucketingSink to save the
> result of job processing to HDFS.
> The data is in JSON format and we store one object per line in the resulting
> files.
>
> We are planning to upgrade to Flink 1.6 and we see that there is this new
> StreamingFileSink,  from the description it looks very similar to
> BucketingSink when using Row-encoded Output Format, my question is, should
> we consider to move to StreamingFileSink?
>
> I would like to better understand what are the suggested use cases for each
> of the two options now (?)
>
> We are also considering to additionally output the data in Parquet format
> for data scientists (to be stored in HDFS as well), for this I see some
> utils to work with StreamingFileSink, so I guess for this case it's
> recommended to use that option(?).
> Is it possible to use the Parquet writers even when the schema of the data
> may evolve ?
>
> Thanks in advance for your help.
> (Sorry if I put too many questions in the same message)
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: BucketingSink vs StreamingFileSink

Edward Rojas
Thank you very much for the information Andrey.

I'll try on my side to do the migration of what we have now and try to add the sink with Parquet and I'll be back to you if I have more questions :)

Edward

El vie., 16 nov. 2018 a las 19:54, Andrey Zagrebin (<[hidden email]>) escribió:
Hi,

StreamingFileSink is supposed to subsume BucketingSink which will be deprecated.

StreamingFileSink fixes some issues of BucketingSink, especially with AWS s3
and adds more flexibility with defining the rolling policy.

StreamingFileSink does not support older hadoop versions at the moment,
but there are ideas how to resolve this.

You can have a look how to use StreamingFileSink with Parquet here [1].

I also cc’ed Kostas, he might add more to this topic.

Best,
Andrey

[1] https://github.com/apache/flink/blob/0b4947b6142f813d2f1e0e662d0fefdecca0e382/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

> On 16 Nov 2018, at 11:31, Edward Rojas <[hidden email]> wrote:
>
> Hello,
> We are currently using Flink 1.5 and we use the BucketingSink to save the
> result of job processing to HDFS.
> The data is in JSON format and we store one object per line in the resulting
> files.
>
> We are planning to upgrade to Flink 1.6 and we see that there is this new
> StreamingFileSink,  from the description it looks very similar to
> BucketingSink when using Row-encoded Output Format, my question is, should
> we consider to move to StreamingFileSink?
>
> I would like to better understand what are the suggested use cases for each
> of the two options now (?)
>
> We are also considering to additionally output the data in Parquet format
> for data scientists (to be stored in HDFS as well), for this I see some
> utils to work with StreamingFileSink, so I guess for this case it's
> recommended to use that option(?).
> Is it possible to use the Parquet writers even when the schema of the data
> may evolve ?
>
> Thanks in advance for your help.
> (Sorry if I put too many questions in the same message)
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



--
Edward Alexander Rojas Clavijo

Software Engineer
Hybrid Cloud
IBM France