Customize Part file naming (Flink 1.9.0)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Customize Part file naming (Flink 1.9.0)

amran dean
Hello,
StreamingFileSink's part file naming convention is not adjustable. It has form: part-<integer>-<integer>. 

My use case for StreamingFileSink is a Kafka -> S3 pipeline, and files are read and processed from S3 using spark. In almost all cases, I want to compress raw data before writing to S3 using the BulkFormat. 

Spark relies on filename extensions to do compression inference, so the current naming scheme results in gibberish. I see that 1.10 currently provides the ability to customize the suffix/prefix, but I really need an alternative solution to this as soon as possible. Can this be backported to 1.9, or are there alternatives?


Reply | Threaded
Open this post in threaded view
|

Re: Customize Part file naming (Flink 1.9.0)

Ravi Bhushan Ratnakar
Hi,

As an alternative, you may use BucketingSink which provides you the provision to customize suffix/prefix.

On Sat, Oct 19, 2019 at 3:54 AM amran dean <[hidden email]> wrote:
Hello,
StreamingFileSink's part file naming convention is not adjustable. It has form: part-<integer>-<integer>. 

My use case for StreamingFileSink is a Kafka -> S3 pipeline, and files are read and processed from S3 using spark. In almost all cases, I want to compress raw data before writing to S3 using the BulkFormat. 

Spark relies on filename extensions to do compression inference, so the current naming scheme results in gibberish. I see that 1.10 currently provides the ability to customize the suffix/prefix, but I really need an alternative solution to this as soon as possible. Can this be backported to 1.9, or are there alternatives?


Reply | Threaded
Open this post in threaded view
|

Re: Customize Part file naming (Flink 1.9.0)

taher koitawala-2
Beware when using Bucketing sink as it does not follow exactly once semantics. Also it has issues with s3 consistency.



On Sat, Oct 19, 2019, 1:42 PM Ravi Bhushan Ratnakar <[hidden email]> wrote:
Hi,

As an alternative, you may use BucketingSink which provides you the provision to customize suffix/prefix.

On Sat, Oct 19, 2019 at 3:54 AM amran dean <[hidden email]> wrote:
Hello,
StreamingFileSink's part file naming convention is not adjustable. It has form: part-<integer>-<integer>. 

My use case for StreamingFileSink is a Kafka -> S3 pipeline, and files are read and processed from S3 using spark. In almost all cases, I want to compress raw data before writing to S3 using the BulkFormat. 

Spark relies on filename extensions to do compression inference, so the current naming scheme results in gibberish. I see that 1.10 currently provides the ability to customize the suffix/prefix, but I really need an alternative solution to this as soon as possible. Can this be backported to 1.9, or are there alternatives?