Re: Distribute DataSet to subset of nodes
Posted by
Fabian Hueske-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Distribute-DataSet-to-subset-of-nodes-tp2814p2843.html
Hi Stefan,
the problem is that you cannot directly influence the scheduling of tasks to nodes to ensure that you can read the data that you put in the local filesystems of your nodes. HDFS gives a shared file system which means that each node can read data from anywhere in the cluster.
I assumed the data is small enough to broadcast because you want to keep it in memory.
Regarding your question. It is not guaranteed that two different tasks, each with parallelism 5, will be distributed to all 10 nodes (even if you have only 10 processing slots).
What would work is to have one map task with parallelism 10 and a Flink setup with 10 task managers on 10 machines with only one processing slot per TM. However, you won't be able to replicate the data to both sets of maps because you cannot know which task instance will be executed on which machine (you cannot distinguish the tasks of both task sets).
As I said, reading from local file system in a cluster and forcing task scheduling to specific nodes is quite tricky.
Cheers, Fabian