Non parallel file sources

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Non parallel file sources

Nick Bendtner
Hi guys,
What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ? 


Best,
Nick. 
Reply | Threaded
Open this post in threaded view
|

Re: Non parallel file sources

Laurent Exsteens
Hi Nick,

On a project I worked on, we simply made the file accessible on a shared NFS drive.
Our source was custom, and we forced it to parallelism 1 inside the job, so the file wouldn't be read multiple times. The rest of the job was distributed.
This was also on a standalone cluster. On a resource managed cluster I guess the resource manager could take care of copying the file for us.

Hope this can help. If there would have been a better solution, I'm also happy to hear it :).

Regards,

Laurent.

On Tue, Jun 23, 2020, 20:51 Nick Bendtner <[hidden email]> wrote:
Hi guys,
What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ? 


Best,
Nick. 

 Be green, keep it on the screen
Reply | Threaded
Open this post in threaded view
|

Re: Non parallel file sources

Vishwas Siravara
Thanks that makes sense. 

On Tue, Jun 23, 2020 at 2:13 PM Laurent Exsteens <[hidden email]> wrote:
Hi Nick,

On a project I worked on, we simply made the file accessible on a shared NFS drive.
Our source was custom, and we forced it to parallelism 1 inside the job, so the file wouldn't be read multiple times. The rest of the job was distributed.
This was also on a standalone cluster. On a resource managed cluster I guess the resource manager could take care of copying the file for us.

Hope this can help. If there would have been a better solution, I'm also happy to hear it :).

Regards,

Laurent.


On Tue, Jun 23, 2020, 20:51 Nick Bendtner <[hidden email]> wrote:
Hi guys,
What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ? 


Best,
Nick. 

 Be green, keep it on the screen
Reply | Threaded
Open this post in threaded view
|

Re: Non parallel file sources

Arvid Heise-3
Another option if the file is small enough is to load it in the driver and directly initialize an in-memory source (env.fromElements).

On Tue, Jun 23, 2020 at 9:57 PM Vishwas Siravara <[hidden email]> wrote:
Thanks that makes sense. 

On Tue, Jun 23, 2020 at 2:13 PM Laurent Exsteens <[hidden email]> wrote:
Hi Nick,

On a project I worked on, we simply made the file accessible on a shared NFS drive.
Our source was custom, and we forced it to parallelism 1 inside the job, so the file wouldn't be read multiple times. The rest of the job was distributed.
This was also on a standalone cluster. On a resource managed cluster I guess the resource manager could take care of copying the file for us.

Hope this can help. If there would have been a better solution, I'm also happy to hear it :).

Regards,

Laurent.


On Tue, Jun 23, 2020, 20:51 Nick Bendtner <[hidden email]> wrote:
Hi guys,
What is the best way to process a file from a unix file system since there is no guarantee as to which task manager will be assigned to process the file. We run flink in standalone mode. We currently follow the brute force way in which we copy the file to every task manager, is there a better way to do this ? 


Best,
Nick. 

 Be green, keep it on the screen


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng