partition columns with StreamingFileSink

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

partition columns with StreamingFileSink

Yitzchak Lieberman
Hi.

I'm using the StreamingFileSink for writing partitioned data to s3.
The code is below:
StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat(new Path("s3a://test-bucket/test"),
ParquetAvroFactory.getParquetWriter(schema, "GZIP"))
.withBucketAssigner(new PartitionBucketAssigner(partitionColumns))
.build();
How can i remove the partition columns from the data (or not populating them in the GenericRecord)?
My problem is with AWS Glue crawler which creates duplicate columns in the table.

Thanks,
Yitzchak.