Hi, I'm trying to stream events in Prorobuf format into a parquet file. I looked into both streaming-file options: BucketingSink & StreamingFileSink. I first tried using the newer StreamingFileSink with the forBulkFormat API. I noticed there's currently support only for the Avro format with the ParquetAvroWriters. I followed the same convention as Avro and wrote a ParquetProtoWriters builder class: public class ParquetProtoWriters { And then I use it as follows: StreamingFileSink I ran tests on the ParquetProtoWriters itself and it writes everything properly and i'm able to read the files. When I use the sink as part of a job I see illegal Parquet files created: # parquet-tools cat .part-0-0.inprogress.b3a65d7f-f297-46bd-bdeb-ad24106ad3ea Can anyone suggest what am I missing here? When trying to use the BucketingSink, I wrote a Writer class for Protobuf and everything worked perfectly: public class FlinkProtoParquetWriter<T extends MessageOrBuilder> implements Writer<T> { Rafi |
Hi,
I’m not sure, but shouldn’t you be just reading committed files and ignore in-progress? Maybe Kostas could add more insight to this topic. Piotr Nowojski
|
Hi Rafi, Piotr is correct. In-progress files are not necessarily readable. The valid files are the ones that are "committed" or finalized. Cheers, KostasOn Thu, Mar 21, 2019 at 2:53 PM Piotr Nowojski <[hidden email]> wrote:
|
Hi Piotr and Kostas, Thanks for your reply. The issue is that I don't see any committed files, only in-progress. I tried to debug the code for more details. I see that in BulkPartWriter I do reach the write methods and see events getting written, but I never reach the closeForCommit. I reach straight to the close function where all parts are disposed. In my job I have a finite stream (source is reading from parquet file/s). Doing some windowed aggregation and writing back to a parquet file. As far as I know, it should commit files during checkpoints and when the stream has finished. I did enabled checkpointing. I did verify that if I connect to other sinks, I see the events. Let me know if I can provide any further information that could be helpful. Would appreciate your help. Thanks, Rafi On Thu, Mar 21, 2019 at 5:20 PM Kostas Kloudas <[hidden email]> wrote:
|
Hi Rafi, Have you enabled checkpointing for you job? Cheers, Kostas On Thu, Mar 21, 2019 at 5:18 PM Rafi Aroch <[hidden email]> wrote:
|
Hi Kostas, Yes I have. Rafi On Thu, Mar 21, 2019, 20:47 Kostas Kloudas <[hidden email]> wrote:
|
Hi Kostas, Thank you. I'm currently testing my job against a small file, so it's finishing before the checkpointing starts. But also if it was a larger file and checkpoint did happen, there would always be the tailing events starting after the last checkpoint until the source has finished. So would these events be lost? In this case, any flow which is (bounded stream) => (StreamingFileSink) would not give the expected results... The other alternative would be using BucketingSink, but it would not guaranty exactly-once into S3 which is not preferable. Can you suggest any workaround? Somehow making sure checkpointing is triggered at the end? Rafi On Thu, Mar 21, 2019 at 9:40 PM Kostas Kloudas <[hidden email]> wrote:
|
Hi Rafi,
There is also an ongoing effort to support bounded streams in DataStream API [1], which might provide the backbone for the functionalists that you need. Piotrek [1] https://issues.apache.org/jira/browse/FLINK-11875
|
Thanks Piotr & Kostas. Really looking forward to this :) Rafi On Wed, Mar 27, 2019 at 10:58 AM Piotr Nowojski <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |