Verifying correctness of StreamingFileSink (Kafka -> S3)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Verifying correctness of StreamingFileSink (Kafka -> S3)

amran dean
I am evaluating StreamingFileSink (Kafka 0.10.11) as a production-ready alternative to a current Kafka -> S3 solution.

Is there any way to verify the integrity of data written in S3? I'm confused how the file names (e.g part-1-17) map to Kafka partitions, and further unsure how to ensure that no Kafka records are lost (I know Flink guarantees exactly-once, but this is more of a sanity check).




Reply | Threaded
Open this post in threaded view
|

Re: Verifying correctness of StreamingFileSink (Kafka -> S3)

Kostas Kloudas-5
Hi Amran,

If you want to know from which partition your input data come from,
you can always have a separate bucket for each partition.
As described in [1], you can extract the offset/partition/topic
information for an incoming record and based on this, decide the
appropriate bucket to put the record.

Cheers,
Kostas

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html

On Wed, Oct 16, 2019 at 4:00 AM amran dean <[hidden email]> wrote:
>
> I am evaluating StreamingFileSink (Kafka 0.10.11) as a production-ready alternative to a current Kafka -> S3 solution.
>
> Is there any way to verify the integrity of data written in S3? I'm confused how the file names (e.g part-1-17) map to Kafka partitions, and further unsure how to ensure that no Kafka records are lost (I know Flink guarantees exactly-once, but this is more of a sanity check).
>
>
>
>