Hey There,
I'm trying to calculate the betweenness in a graph with Flink and Gelly, the way I tried this was by calculating the shortest path from every node to the rest of the nodes. This results in a Dataset of vertices which all have Datasets of their own with all the other vertices and their paths. Next i used the Reduce function on the inner DataSets so every inner DataSet has 1 value. Now I have a DataSet of DataSets with 1 value each, but how do i efficiently transform this into a a single DataSet with values? I can do a mapping on the DataSet and use collect(), but i think that would be very costly |
Hello,
How exactly do you represent the DataSet of DataSets? I'm asking because if you have something like a DataSet<DataSet<A>> that unfortunately doesn't work in Flink. Best, Gábor 2016-11-14 20:44 GMT+01:00 otherwise777 <[hidden email]>: > Hey There, > > I'm trying to calculate the betweenness in a graph with Flink and Gelly, the > way I tried this was by calculating the shortest path from every node to the > rest of the nodes. This results in a Dataset of vertices which all have > Datasets of their own with all the other vertices and their paths. > > Next i used the Reduce function on the inner DataSets so every inner DataSet > has 1 value. > > Now I have a DataSet of DataSets with 1 value each, but how do i efficiently > transform this into a a single DataSet with values? I can do a mapping on > the DataSet and use collect(), but i think that would be very costly > > > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Retrieving-values-from-a-dataset-of-datasets-tp10108.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. |
It seems what i tried did indeed not work.
Can you explain me why that doesn't work though? |
The short answer is that because DataSet is not serializable.
I think the main underlying problem is that Flink needs to see all DataSet operations before launching the job. However, if you have a DataSet<DataSet<A>>, then operations on the inner DataSets will end up being specified inside the UDFs of operations on the outer DataSet. This is a problem, because Flink cannot see inside the UDFs before the job starts, since they get executed only after the job starts executing. There are some workarounds though: 1. If you know that your inner DataSets would be small, then you can instead replace them with some regular Java/Scala collection class, like an Array or List. 2. You can often flatten your data, that is, somehow represent your nested collection with a flat collection. Exactly how to do this depends on your use case. For example, suppose that originally we wanted to represent the lengths of the shortest paths between all pairs of vertices in a graph by a DataSet that for every vertex contains a DataSet that tells us the distances to all the other Vertices: DataSet<Tuple2<Vertex, DataSet<Tuple2<Vertex, Int>>>> This doesn't work because of the nested DataSets, but you could flatten this into the following: DataSet<Tuple3<Vertex, Vertex, Int>> which is a DataSet that contains pairs of vertices and their distances. Btw. [1] is a paper where some graph data structures having complex nesting are represented in Flink. Best, Gábor [1] http://dbs.uni-leipzig.de/file/EPGM.pdf 2016-11-15 17:37 GMT+01:00 otherwise777 <[hidden email]>: > It seems what i tried did indeed not work. > Can you explain me why that doesn't work though? > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Retrieving-values-from-a-dataset-of-datasets-tp10108p10128.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. |
Free forum by Nabble | Edit this page |