Merging sets with common elements

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Merging sets with common elements

Simone Robutti
Hello,

I'm implementing MinHash for reccomendation on Flink. I'm almost done but I need an efficient way to merge sets of similar keys together (and later join these sets of keys with more data).

The actual data structure is of the form DataSet[(Int,Set[Int])] where the left element of the tuple is an ID for the right element, that is a set of keys. I want to merge these sets together only if they share at least one element. 

I'm rather sure to have studied the efficient solution to this problem in a local environment but I don't really know how to treat it in a distributed environment. Any suggestion?

Thanks,

Simone
Reply | Threaded
Open this post in threaded view
|

Re: Merging sets with common elements

Till Rohrmann
Hi Simone,

could you elaborate a little bit on the actual operation you want to perform. Given a data set {(1, {1,2}), (2, {2,3})} what's the result of your operation? Is the result { ({1,2}, {1,2,3}) } because the 2 is contained in both sets?

Cheers,
Till

On Wed, May 25, 2016 at 10:22 AM, Simone Robutti <[hidden email]> wrote:
Hello,

I'm implementing MinHash for reccomendation on Flink. I'm almost done but I need an efficient way to merge sets of similar keys together (and later join these sets of keys with more data).

The actual data structure is of the form DataSet[(Int,Set[Int])] where the left element of the tuple is an ID for the right element, that is a set of keys. I want to merge these sets together only if they share at least one element. 

I'm rather sure to have studied the efficient solution to this problem in a local environment but I don't really know how to treat it in a distributed environment. Any suggestion?

Thanks,

Simone

Reply | Threaded
Open this post in threaded view
|

Re: Merging sets with common elements

Aljoscha Krettek
In reply to this post by Simone Robutti
Hi,
if I understood it correctly the "key" in that case would be a fuzzy/probabilistic key. I'm not sure this can be computed using either the sort-based or hash-based joinging/grouping strategies of Flink. Maybe we can find something if you elaborate.

Cheers,
Aljoscha

On Wed, 25 May 2016 at 10:22 Simone Robutti <[hidden email]> wrote:
Hello,

I'm implementing MinHash for reccomendation on Flink. I'm almost done but I need an efficient way to merge sets of similar keys together (and later join these sets of keys with more data).

The actual data structure is of the form DataSet[(Int,Set[Int])] where the left element of the tuple is an ID for the right element, that is a set of keys. I want to merge these sets together only if they share at least one element. 

I'm rather sure to have studied the efficient solution to this problem in a local environment but I don't really know how to treat it in a distributed environment. Any suggestion?

Thanks,

Simone
Reply | Threaded
Open this post in threaded view
|

Re: Merging sets with common elements

Simone Robutti
In reply to this post by Till Rohrmann
@Till:

A more meaningful example would be the following: from {{1{1,2}},{2,{2,3}},{3,{4,5},{4{1,27}}}} the result should be {1,2,3,27},{4,5} because the set #1,#2 and #4 have at least one element in common. 

If you see this as a graph where the elements of the sets are nodes and a set express a full connection of all the elements of said set , you can see the result as the connected components of the graph. 


2016-05-25 11:42 GMT+02:00 Till Rohrmann <[hidden email]>:
Hi Simone,

could you elaborate a little bit on the actual operation you want to perform. Given a data set {(1, {1,2}), (2, {2,3})} what's the result of your operation? Is the result { ({1,2}, {1,2,3}) } because the 2 is contained in both sets?

Cheers,
Till

On Wed, May 25, 2016 at 10:22 AM, Simone Robutti <[hidden email]> wrote:
Hello,

I'm implementing MinHash for reccomendation on Flink. I'm almost done but I need an efficient way to merge sets of similar keys together (and later join these sets of keys with more data).

The actual data structure is of the form DataSet[(Int,Set[Int])] where the left element of the tuple is an ID for the right element, that is a set of keys. I want to merge these sets together only if they share at least one element. 

I'm rather sure to have studied the efficient solution to this problem in a local environment but I don't really know how to treat it in a distributed environment. Any suggestion?

Thanks,

Simone