Streaming File Sink - Parquet File Writer

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Streaming File Sink - Parquet File Writer

Vinay Patil
Hi,

I am not able to roll the files based on file size as the bulkFormat has onCheckpointRollingPolicy.

One way is to write CustomStreamingFileSink and provide RollingPolicy like RowFormatBuilder. Is this the correct way to go ahead ?

Another way is to write ParquetEncoder and use RowFormatBuilder.

P.S. Curious to know Why was the RollingPolicy not exposed in case of BulkFormat ?

Regards,
Vinay Patil
Reply | Threaded
Open this post in threaded view
|

Re: Streaming File Sink - Parquet File Writer

Kostas Kloudas-2
Hi Vinay,

You are correct when saying that the bulk formats only support
onCheckpointRollingPolicy.

The reason for this has to do with the fact that currently Flink
relies on the Hadoop writer for Parquet.

Bulk formats keep important details about how they write the actual
data (such as compression
schemes, offsets, etc) in metadata and they write this metadata with
the file (e.g. parquet writes
them as a footer). The hadoop writer gives no access to these
metadata. Given this, there is
no way for flink to be able to checkpoint a part file securely without
closing it.

The solution would be to write our own writer and not go through the
hadoop one, but there
are no concrete plans for this, as far as I know.

Cheers,
Kostas


On Tue, Oct 29, 2019 at 12:57 PM Vinay Patil <[hidden email]> wrote:

>
> Hi,
>
> I am not able to roll the files based on file size as the bulkFormat has onCheckpointRollingPolicy.
>
> One way is to write CustomStreamingFileSink and provide RollingPolicy like RowFormatBuilder. Is this the correct way to go ahead ?
>
> Another way is to write ParquetEncoder and use RowFormatBuilder.
>
> P.S. Curious to know Why was the RollingPolicy not exposed in case of BulkFormat ?
>
> Regards,
> Vinay Patil