Hello,
We are currently using Flink 1.5 and we use the BucketingSink to save the result of job processing to HDFS. The data is in JSON format and we store one object per line in the resulting files. We are planning to upgrade to Flink 1.6 and we see that there is this new StreamingFileSink, from the description it looks very similar to BucketingSink when using Row-encoded Output Format, my question is, should we consider to move to StreamingFileSink? I would like to better understand what are the suggested use cases for each of the two options now (?) We are also considering to additionally output the data in Parquet format for data scientists (to be stored in HDFS as well), for this I see some utils to work with StreamingFileSink, so I guess for this case it's recommended to use that option(?). Is it possible to use the Parquet writers even when the schema of the data may evolve ? Thanks in advance for your help. (Sorry if I put too many questions in the same message) -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi,
StreamingFileSink is supposed to subsume BucketingSink which will be deprecated. StreamingFileSink fixes some issues of BucketingSink, especially with AWS s3 and adds more flexibility with defining the rolling policy. StreamingFileSink does not support older hadoop versions at the moment, but there are ideas how to resolve this. You can have a look how to use StreamingFileSink with Parquet here [1]. I also cc’ed Kostas, he might add more to this topic. Best, Andrey [1] https://github.com/apache/flink/blob/0b4947b6142f813d2f1e0e662d0fefdecca0e382/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java > On 16 Nov 2018, at 11:31, Edward Rojas <[hidden email]> wrote: > > Hello, > We are currently using Flink 1.5 and we use the BucketingSink to save the > result of job processing to HDFS. > The data is in JSON format and we store one object per line in the resulting > files. > > We are planning to upgrade to Flink 1.6 and we see that there is this new > StreamingFileSink, from the description it looks very similar to > BucketingSink when using Row-encoded Output Format, my question is, should > we consider to move to StreamingFileSink? > > I would like to better understand what are the suggested use cases for each > of the two options now (?) > > We are also considering to additionally output the data in Parquet format > for data scientists (to be stored in HDFS as well), for this I see some > utils to work with StreamingFileSink, so I guess for this case it's > recommended to use that option(?). > Is it possible to use the Parquet writers even when the schema of the data > may evolve ? > > Thanks in advance for your help. > (Sorry if I put too many questions in the same message) > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Thank you very much for the information Andrey. I'll try on my side to do the migration of what we have now and try to add the sink with Parquet and I'll be back to you if I have more questions :) Edward El vie., 16 nov. 2018 a las 19:54, Andrey Zagrebin (<[hidden email]>) escribió: Hi, Edward Alexander Rojas Clavijo Software Engineer Hybrid Cloud IBM France |
Free forum by Nabble | Edit this page |