(DEPRECATED) Apache Flink User Mailing List archive.

Non parallel file sources

Classic

List

Threaded

4 messages Options

Nick Bendtner

Non parallel file sources

Hi guys,

What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ?

Best,

Nick.

Laurent Exsteens

Re: Non parallel file sources

Hi Nick,

On a project I worked on, we simply made the file accessible on a shared NFS drive.

Our source was custom, and we forced it to parallelism 1 inside the job, so the file wouldn't be read multiple times. The rest of the job was distributed.

This was also on a standalone cluster. On a resource managed cluster I guess the resource manager could take care of copying the file for us.

Hope this can help. If there would have been a better solution, I'm also happy to hear it :).

Regards,

Laurent.

On Tue, Jun 23, 2020, 20:51 Nick Bendtner <[hidden email]> wrote:

Hi guys,
What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ?

Best,
Nick.

♻ Be green, keep it on the screen

Vishwas Siravara

Re: Non parallel file sources

Thanks that makes sense.

On Tue, Jun 23, 2020 at 2:13 PM Laurent Exsteens <[hidden email]> wrote:

Hi Nick,

On a project I worked on, we simply made the file accessible on a shared NFS drive.
Our source was custom, and we forced it to parallelism 1 inside the job, so the file wouldn't be read multiple times. The rest of the job was distributed.
This was also on a standalone cluster. On a resource managed cluster I guess the resource manager could take care of copying the file for us.

Hope this can help. If there would have been a better solution, I'm also happy to hear it :).

Regards,

Laurent.

On Tue, Jun 23, 2020, 20:51 Nick Bendtner <[hidden email]> wrote:
Hi guys,
What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ?

Best,
Nick.

♻ Be green, keep it on the screen

Arvid Heise-3

Re: Non parallel file sources

Another option if the file is small enough is to load it in the driver and directly initialize an in-memory source (env.fromElements).

On Tue, Jun 23, 2020 at 9:57 PM Vishwas Siravara <[hidden email]> wrote:

Thanks that makes sense.

On Tue, Jun 23, 2020 at 2:13 PM Laurent Exsteens <[hidden email]> wrote:
Hi Nick,

On a project I worked on, we simply made the file accessible on a shared NFS drive.
Our source was custom, and we forced it to parallelism 1 inside the job, so the file wouldn't be read multiple times. The rest of the job was distributed.
This was also on a standalone cluster. On a resource managed cluster I guess the resource manager could take care of copying the file for us.

Hope this can help. If there would have been a better solution, I'm also happy to hear it :).

Regards,

Laurent.

On Tue, Jun 23, 2020, 20:51 Nick Bendtner <[hidden email]> wrote:
Hi guys,
What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ?

Best,
Nick.

♻ Be green, keep it on the screen

Arvid Heise | Senior Java Developer

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng