Retrieving values from a dataset of datasets

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Retrieving values from a dataset of datasets

otherwise777
Hey There,

I'm trying to calculate the betweenness in a graph with Flink and Gelly, the way I tried this was by calculating the shortest path from every node to the rest of the nodes. This results in a Dataset of vertices which all have Datasets of their own with all the other vertices and their paths.

Next i used the Reduce function on the inner DataSets so every inner DataSet has 1 value.

Now I have a DataSet of DataSets with 1 value each, but how do i efficiently transform this into a a single DataSet with values? I can do a mapping on the DataSet and use collect(), but i think that would be very costly

Reply | Threaded
Open this post in threaded view
|

Re: Retrieving values from a dataset of datasets

Gábor Gévay
Hello,

How exactly do you represent the DataSet of DataSets? I'm asking
because if you have something like a
DataSet<DataSet<A>>
that unfortunately doesn't work in Flink.

Best,
Gábor






2016-11-14 20:44 GMT+01:00 otherwise777 <[hidden email]>:

> Hey There,
>
> I'm trying to calculate the betweenness in a graph with Flink and Gelly, the
> way I tried this was by calculating the shortest path from every node to the
> rest of the nodes. This results in a Dataset of vertices which all have
> Datasets of their own with all the other vertices and their paths.
>
> Next i used the Reduce function on the inner DataSets so every inner DataSet
> has 1 value.
>
> Now I have a DataSet of DataSets with 1 value each, but how do i efficiently
> transform this into a a single DataSet with values? I can do a mapping on
> the DataSet and use collect(), but i think that would be very costly
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Retrieving-values-from-a-dataset-of-datasets-tp10108.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving values from a dataset of datasets

otherwise777
It seems what i tried did indeed not work.
Can you explain me why that doesn't work though?
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving values from a dataset of datasets

Gábor Gévay
The short answer is that because DataSet is not serializable.

I think the main underlying problem is that Flink needs to see all
DataSet operations before launching the job. However, if you have a
DataSet<DataSet<A>>, then operations on the inner DataSets will end up
being specified inside the UDFs of operations on the outer DataSet.
This is a problem, because Flink cannot see inside the UDFs before the
job starts, since they get executed only after the job starts
executing.

There are some workarounds though:

1. If you know that your inner DataSets would be small, then you can
instead replace them with some regular Java/Scala collection class,
like an Array or List.

2. You can often flatten your data, that is, somehow represent your
nested collection with a flat collection. Exactly how to do this
depends on your use case. For example, suppose that originally we
wanted to represent the lengths of the shortest paths between all
pairs of vertices in a graph by a DataSet that for every vertex
contains a DataSet that tells us the distances to all the other
Vertices:
DataSet<Tuple2<Vertex, DataSet<Tuple2<Vertex, Int>>>>
This doesn't work because of the nested DataSets, but you could
flatten this into the following:
DataSet<Tuple3<Vertex, Vertex, Int>>
which is a DataSet that contains pairs of vertices and their distances.

Btw. [1] is a paper where some graph data structures having complex
nesting are represented in Flink.

Best,
Gábor

[1] http://dbs.uni-leipzig.de/file/EPGM.pdf





2016-11-15 17:37 GMT+01:00 otherwise777 <[hidden email]>:
> It seems what i tried did indeed not work.
> Can you explain me why that doesn't work though?
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Retrieving-values-from-a-dataset-of-datasets-tp10108p10128.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.