Looping over a DataSet and accesing another DataSet

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Looping over a DataSet and accesing another DataSet

otherwise777
Currently i'm trying to implement this algorithm [1] which requires me to loop over one DataSet (the edges) and access another DataSet (the vertices), for this loop i use a Mapping (i'm not sure if this is the correct way of looping over a DataSet) but i don't know how to access the elements of another DataSet while i'm looping over one.

I know Gelly also has iterative support for these kind of things, but they loop over the Vertices and not the Edges

[1] http://prntscr.com/d0qeyd
Reply | Threaded
Open this post in threaded view
|

Re: Looping over a DataSet and accesing another DataSet

Gábor Gévay
Hello,

In Flink, one often used way to access data from multiple DataSets at
the same time is to perform a join (Flink actually calls equi-joins
[1] just "join"), just as in the database world.

For example, in the algorithm that you linked, you access A[u] for
every edge (u,v). I assume that you have stored A in a DataSet of
(index, value) pairs. You can achieve this access pattern by
performing a join, and in the join condition you specify that the
first endpoint of the edge should be equal to the index of A. This
way, you get a DataSet where every record contains an edge (u,v) and
also A[u], so you can do a map on this where the UDF of your map will
get (u,v) and A[u].

Your algorithm also accesses A[v], which can be achieved by performing
a second join that is similar to the first (using the result of the
first).

However, the updating of P will be more tricky to translate to Flink.
I'm not sure I undersand the linked algorithm correctly: does every
element of P contain a list, and the + means appending an element to a
list? (in the line P[v] = P[u] + v)

Best,
Gábor

[1] https://en.wikipedia.org/wiki/Join_(SQL)#Equi-join



2016-10-30 8:25 GMT+01:00 otherwise777 <[hidden email]>:

> Currently i'm trying to implement this algorithm [1] which requires me to
> loop over one DataSet (the edges) and access another DataSet (the vertices),
> for this loop i use a Mapping (i'm not sure if this is the correct way of
> looping over a DataSet) but i don't know how to access the elements of
> another DataSet while i'm looping over one.
>
> I know Gelly also has iterative support for these kind of things, but they
> loop over the Vertices and not the Edges
>
> [1] http://prntscr.com/d0qeyd
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Looping over a DataSet and accesing another DataSet

otherwise777
This post was updated on .
Edit: sorry i didn't read your reply correctly, i'm currectly trying some stuff out
Reply | Threaded
Open this post in threaded view
|

Re: Looping over a DataSet and accesing another DataSet

Greg Hogan
The DataSet API only supports binary joins but one can simulate an n-ary join by chaining successive join operations.

Your algorithm requires a global ordering on edges, requiring a parallelism of 1, and will not scale in a distributed processing system. Flink excels at processing bulk (larger than memory) data in serial.

Greg

On Mon, Oct 31, 2016 at 5:54 AM, otherwise777 <[hidden email]> wrote:
Thank you for your reply and explanation, I think there is one issue with
your method though, you said that i should make a join with the the key
value pair A on v and  the Edge set (u,v), this would work, however i not
only need to access A[v] in one iteration but also A[u], so if i join on v
that won't be possible

Did i understand it correctly?



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778p9782.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Looping over a DataSet and accesing another DataSet

otherwise777
Thank you for your reply, this is new information for me,

Regarding the algorithm, i gave it a better look and i don't think it will work with joining. When looping over the Edge set (u,v) we need to be able to write and read A[u] and A[v]. If i join them it will create a new instances of that value and it doesn't matter if it's changed in one instance.

For example i have the following edges:
 u v
 1 2
 1 3

With vertices and values:
 1 a
 2 b
 3 c

If i join them i get:
 u v u' v'
 1 2 a b
 1 3 a c

If i loop over the joined set and change the u' value of the first instance to "d" then in my next loop step it will be 'a'.
Reply | Threaded
Open this post in threaded view
|

Re: Looping over a DataSet and accesing another DataSet

Greg Hogan
By 'loop' do you refer to an iteration? The output of a bulk iteration is processed as the input of the following iteration. Values updated in an iteration are available in the next iteration just as values updated by an operator are available to the following operator.

Your chosen algorithm may not be a good fit for distributed processing frameworks like Flink, Spark, and Hadoop. You may need to recast your problem into an appropriate, scalable algorithm. Both the Gelly and Machine Learning libraries have good examples of efficient, scalable algorithms (Flink's "examples" demonstrate specific functionality).

Greg

On Mon, Oct 31, 2016 at 8:52 AM, otherwise777 <[hidden email]> wrote:
Thank you for your reply, this is new information for me,

Regarding the algorithm, i gave it a better look and i don't think it will
work with joining. When looping over the Edge set (u,v) we need to be able
to write and read A[u] and A[v]. If i join them it will create a new
instances of that value and it doesn't matter if it's changed in one
instance.

For example i have the following edges:
 u v
 1 2
 1 3

With vertices and values:
 1 a
 2 b
 3 c

If i join them i get:
 u v u' v'
 1 2 a b
 1 3 a c

If i loop over the joined set and change the u' value of the first instance
to "d" then in my next loop step it will be 'a'.




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Looping-over-a-DataSet-and-accesing-another-DataSet-tp9778p9784.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Looping over a DataSet and accesing another DataSet

otherwise777
I did mean the iteratino yes, I currently solved the problem by rewriting the algorithm in gelly's GathersumApply model, thnx for the tips

I had another question regarding the original message, about appending items to a list, how would I do that? Because afaik it's not possible to add a list or array in a Tuple element right?

Reply | Threaded
Open this post in threaded view
|

Re: Looping over a DataSet and accesing another DataSet

otherwise777
In reply to this post by Greg Hogan
I just found out that I am able to use arrays in tuple values, nvm about that question