Currently i'm trying to implement this algorithm [1] which requires me to loop over one DataSet (the edges) and access another DataSet (the vertices), for this loop i use a Mapping (i'm not sure if this is the correct way of looping over a DataSet) but i don't know how to access the elements of another DataSet while i'm looping over one.
I know Gelly also has iterative support for these kind of things, but they loop over the Vertices and not the Edges [1] http://prntscr.com/d0qeyd |
Hello,
In Flink, one often used way to access data from multiple DataSets at the same time is to perform a join (Flink actually calls equi-joins [1] just "join"), just as in the database world. For example, in the algorithm that you linked, you access A[u] for every edge (u,v). I assume that you have stored A in a DataSet of (index, value) pairs. You can achieve this access pattern by performing a join, and in the join condition you specify that the first endpoint of the edge should be equal to the index of A. This way, you get a DataSet where every record contains an edge (u,v) and also A[u], so you can do a map on this where the UDF of your map will get (u,v) and A[u]. Your algorithm also accesses A[v], which can be achieved by performing a second join that is similar to the first (using the result of the first). However, the updating of P will be more tricky to translate to Flink. I'm not sure I undersand the linked algorithm correctly: does every element of P contain a list, and the + means appending an element to a list? (in the line P[v] = P[u] + v) Best, Gábor [1] https://en.wikipedia.org/wiki/Join_(SQL)#Equi-join 2016-10-30 8:25 GMT+01:00 otherwise777 <[hidden email]>: > Currently i'm trying to implement this algorithm [1] which requires me to > loop over one DataSet (the edges) and access another DataSet (the vertices), > for this loop i use a Mapping (i'm not sure if this is the correct way of > looping over a DataSet) but i don't know how to access the elements of > another DataSet while i'm looping over one. > > I know Gelly also has iterative support for these kind of things, but they > loop over the Vertices and not the Edges > > [1] http://prntscr.com/d0qeyd > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. |
This post was updated on .
Edit: sorry i didn't read your reply correctly, i'm currectly trying some stuff out
|
The DataSet API only supports binary joins but one can simulate an n-ary join by chaining successive join operations. Your algorithm requires a global ordering on edges, requiring a parallelism of 1, and will not scale in a distributed processing system. Flink excels at processing bulk (larger than memory) data in serial.On Mon, Oct 31, 2016 at 5:54 AM, otherwise777 <[hidden email]> wrote: Thank you for your reply and explanation, I think there is one issue with |
Thank you for your reply, this is new information for me,
Regarding the algorithm, i gave it a better look and i don't think it will work with joining. When looping over the Edge set (u,v) we need to be able to write and read A[u] and A[v]. If i join them it will create a new instances of that value and it doesn't matter if it's changed in one instance. For example i have the following edges: u v 1 2 1 3 With vertices and values: 1 a 2 b 3 c If i join them i get: u v u' v' 1 2 a b 1 3 a c If i loop over the joined set and change the u' value of the first instance to "d" then in my next loop step it will be 'a'. |
By 'loop' do you refer to an iteration? The output of a bulk iteration is processed as the input of the following iteration. Values updated in an iteration are available in the next iteration just as values updated by an operator are available to the following operator. Your chosen algorithm may not be a good fit for distributed processing frameworks like Flink, Spark, and Hadoop. You may need to recast your problem into an appropriate, scalable algorithm. Both the Gelly and Machine Learning libraries have good examples of efficient, scalable algorithms (Flink's "examples" demonstrate specific functionality). On Mon, Oct 31, 2016 at 8:52 AM, otherwise777 <[hidden email]> wrote: Thank you for your reply, this is new information for me, |
I did mean the iteratino yes, I currently solved the problem by rewriting the algorithm in gelly's GathersumApply model, thnx for the tips
I had another question regarding the original message, about appending items to a list, how would I do that? Because afaik it's not possible to add a list or array in a Tuple element right? |
In reply to this post by Greg Hogan
I just found out that I am able to use arrays in tuple values, nvm about that question
|
Free forum by Nabble | Edit this page |