Use case for StreamingFileSink: Different parquet writers within the Sink

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Use case for StreamingFileSink: Different parquet writers within the Sink

Kailash Dayanand
We have the following use case: We are reading a stream of events which we want to write to different parquet files based on data within the element <IN>. The end goal is to register these parquet files in hive to query. I was exploring the option of using StreamingFileSink for this use case but found a new things which I could not customize.

It looks like StreamingFileSink takes a single schema / provides a ParquetWriter for a specific schema. Since the elements needs to have different Avro schema based on the data in the elements I could not use the sink as-is (AvroParquetWriters, needs to specify the same Schema for the parquetBuilder). So looking a bit deeper I found that there is a WriterFactory here: https://tinyurl.com/y68drj35 . This can be extended to create a BulkPartWriter based on BucketID. Something like this: BulkWriter.Factory<IN, BucketID> writerFactory. In this way you can create a unique ParquetWriter for each bucket.  Is there any other option to do this? 

I considered another option of possibly using a common schema for all the different schemas but I have not fully explored that option. 

Thanks
Kailash
Reply | Threaded
Open this post in threaded view
|

Re: Use case for StreamingFileSink: Different parquet writers within the Sink

Kailash Dayanand
Hello, 

I was able to solve this based by creating a data model where all the incoming events are added into a message envelope and writing a Sink for a dataStream containing these message envelopes. Also I ended up creating parquet writers not when constructing the parquetWriter but instead inside the addElement method of the BulkWriter function. Adding this information for posterior in case someone needs to handle seem use case. 

Thanks
Kailash 

On Sun, Apr 28, 2019 at 3:57 PM Kailash Dayanand <[hidden email]> wrote:
We have the following use case: We are reading a stream of events which we want to write to different parquet files based on data within the element <IN>. The end goal is to register these parquet files in hive to query. I was exploring the option of using StreamingFileSink for this use case but found a new things which I could not customize.

It looks like StreamingFileSink takes a single schema / provides a ParquetWriter for a specific schema. Since the elements needs to have different Avro schema based on the data in the elements I could not use the sink as-is (AvroParquetWriters, needs to specify the same Schema for the parquetBuilder). So looking a bit deeper I found that there is a WriterFactory here: https://tinyurl.com/y68drj35 . This can be extended to create a BulkPartWriter based on BucketID. Something like this: BulkWriter.Factory<IN, BucketID> writerFactory. In this way you can create a unique ParquetWriter for each bucket.  Is there any other option to do this? 

I considered another option of possibly using a common schema for all the different schemas but I have not fully explored that option. 

Thanks
Kailash