(DEPRECATED) Apache Flink User Mailing List archive.

Using S3 as a streaming File source

Classic

List

Threaded

4 messages Options

orionemail

Using S3 as a streaming File source

Hi,

I have a S3 bucket that is continuously written to by millions of devices. These upload small compressed archives.

What I want to do is treat the tar gzipped (.tgz) files as a streaming source and process each archive. The archive contains three files that each might need to be processed.

I see that

env.readFile(f, bucket, FileProcessingMode.PROCESS_CONTINUOUSLY, 10000L).print();

might do what I need, but I am unsure how best to implement 'f' - the InputFileFormat. Is there a similar example for me to reference?

Or is this idea not workable with this method? I need to ensure exactly once, and also trigger removal of the files after processing.

Thanks,

Sent with ProtonMail Secure Email.

Jörn Franke

Re: Using S3 as a streaming File source

Why don’t you get an S3 notification on SQS and do the actions from there?

You will probably need to write the content of the files to a no sql database .

Alternatively send the s3 notification to Kafka and read flink from there.

https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

Am 01.09.2020 um 16:46 schrieb orionemail <[hidden email]>:

Hi,

I have a S3 bucket that is continuously written to by millions of devices. These upload small compressed archives.

What I want to do is treat the tar gzipped (.tgz) files as a streaming source and process each archive. The archive contains three files that each might need to be processed.

I see that
env.readFile(f, bucket, FileProcessingMode.PROCESS_CONTINUOUSLY, 10000L).print();
might do what I need, but I am unsure how best to implement 'f' - the InputFileFormat. Is there a similar example for me to reference?

Or is this idea not workable with this method? I need to ensure exactly once, and also trigger removal of the files after processing.

Thanks,

Sent with ProtonMail Secure Email.

Ayush Verma-2

Re: Using S3 as a streaming File source

Word of caution. Streaming from S3 is really cost prohibitive as the only way to detect new files is to continuously spam the S3 List API.

On Tue, Sep 1, 2020 at 4:50 PM Jörn Franke <[hidden email]> wrote:

Why don’t you get an S3 notification on SQS and do the actions from there?

You will probably need to write the content of the files to a no sql database .

Alternatively send the s3 notification to Kafka and read flink from there.

https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

Am 01.09.2020 um 16:46 schrieb orionemail <[hidden email]>:
Hi,

I have a S3 bucket that is continuously written to by millions of devices. These upload small compressed archives.

What I want to do is treat the tar gzipped (.tgz) files as a streaming source and process each archive. The archive contains three files that each might need to be processed.

I see that
env.readFile(f, bucket, FileProcessingMode.PROCESS_CONTINUOUSLY, 10000L).print();
might do what I need, but I am unsure how best to implement 'f' - the InputFileFormat. Is there a similar example for me to reference?

Or is this idea not workable with this method? I need to ensure exactly once, and also trigger removal of the files after processing.

Thanks,

Sent with ProtonMail Secure Email.

orionemail

Re: Using S3 as a streaming File source

OK thanks for the notice on the cost point. I will check the cost calculations.

This already does have SNS enabled for another solution to this problem, but I'm trying to use the minimal amount of different software components at this stage of the pipeline. My prefered approach would have been them to send this data directly to a Kinesis/Kafka stream but that is not an option at this time.

Thanks for the assistance.

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Tuesday, 1 September 2020 17:53, Ayush Verma <[hidden email]> wrote:

Word of caution. Streaming from S3 is really cost prohibitive as the only way to detect new files is to continuously spam the S3 List API.
On Tue, Sep 1, 2020 at 4:50 PM Jörn Franke <[hidden email]> wrote:
Why don’t you get an S3 notification on SQS and do the actions from there?

You will probably need to write the content of the files to a no sql database .

Alternatively send the s3 notification to Kafka and read flink from there.

https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

Am 01.09.2020 um 16:46 schrieb orionemail <[hidden email]>:
Hi,

I have a S3 bucket that is continuously written to by millions of devices. These upload small compressed archives.

What I want to do is treat the tar gzipped (.tgz) files as a streaming source and process each archive. The archive contains three files that each might need to be processed.

I see that
env.readFile(f, bucket, FileProcessingMode.PROCESS_CONTINUOUSLY, 10000L).print();
might do what I need, but I am unsure how best to implement 'f' - the InputFileFormat. Is there a similar example for me to reference?

Or is this idea not workable with this method? I need to ensure exactly once, and also trigger removal of the files after processing.

Thanks,

Sent with ProtonMail Secure Email.