Re: Distribute DataSet to subset of nodes

Posted by Sachin Goel on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Distribute-DataSet-to-subset-of-nodes-tp2814p2817.html

Hi Stefan
Just a clarification : The output corresponding to an element based on the whole data will be a union of the outputs based on the two halves. Is this what you're trying to achieve? [It appears  like that since every flatMap task will independently produce outputs.]

In that case, one solution could be to simply have two flatMap operations based on parts of the *broadcast* data set, and take a union.

Cheers
Sachin

On Sep 13, 2015 7:04 PM, "Stefan Bunk" <[hidden email]> wrote:
Hi!

Following problem: I have 10 nodes on which I want to execute a flatMap operator on a DataSet. In the open method of the operator, some data is read from disk and preprocessed, which is necessary for the operator. Problem is, the data does not fit in memory on one node, however, half of the data does.
So in five out of ten nodes, I stored one half of the data to be read in the open method, and the other half on the other five nodes.

Now my question: How can I distribute my DataSet, so that each element is sent once to a node with the first half of my data and once to a node with the other half?

I looked at implementing a custom partitioner, however my problems were:
(i) I have no mapping from the number I am supposed to return to the nodes to the data. How do I know, that index 5 contains one half of the data, and index 6 the other half?
(ii) I do not know the current index. Obviously, I want to send my DataSet element only once over the network. 

Best,
Stefan