Re: Distribute DataSet to subset of nodes

Posted by Fabian Hueske-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Distribute-DataSet-to-subset-of-nodes-tp2814p2832.html

Hi Stefan,

forcing the scheduling of tasks to certain nodes and reading files from the local file system in a multi-node setup is actually quite tricky and requires a bit understanding of the internals.
It is possible and I can help you with that, but would recommend to use a shared filesystem such as HDFS if that is possible.

Best, Fabian

2015-09-14 19:16 GMT+02:00 Stefan Bunk <[hidden email]>:
Hi,

actually, I am distributing my data before the program starts, without using broadcast sets.

However, the approach should still work, under one condition:
DataSet mapped1 = data.flatMap(yourMap).withBroadcastSet(smallData1,"data").setParallelism(5);
DataSet mapped2 = data.flatMap(yourMap).withBroadcastSet(smallData2,"data").setParallelism(5);
Is it guaranteed, that this selects a disjoint set of nodes, i.e. five nodes for mapped1 and five other nodes for mapped2?

Is there any way of selecting the five nodes concretely? Currently, I have stored the first half of the data on nodes 1-5 and the second half on nodes 6-10. With this approach, I guess, nodes are selected randomly so I would have to copy both halves to all of the nodes.

Best,
Stefan