Hi to all,
I was reading the thread about the Neo4j connector and an old question came to my mind. In my Flink job I have Tuples with String ids that I use to join on that I'd like to convert to long (because Flink should improve quite a lot the memory usage and the processing time if I'm not wrong). Is there any recommended way to do that conversion in Flink? Best, Flavio |
Hi Flavio,
If you just want to assign a unique Long identifier to each element in your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1]. Best, Martin [1] https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131 On 03.11.2015 09:42, Flavio Pompermaier wrote: > Hi to all, > > I was reading the thread about the Neo4j connector and an old question came > to my mind. > > In my Flink job I have Tuples with String ids that I use to join on that > I'd like to convert to long (because Flink should improve quite a lot the > memory usage and the processing time if I'm not wrong). > Is there any recommended way to do that conversion in Flink? > > Best, > Flavio > |
Hi Martin, thanks for the suggestion but unfortunately in my use case I have another problem: I have to join triplets when f2==f0..is there any way to translate also references? i.e. if I have 2 tuples <a,b,c> <c,d,e>, when I apply that function I obtain something like <1,<a,b,c>>,<2,<c,d,e>>. I'd like to be able to join the 1st tuple with the 2nd (so I should know that c=>2). Which strategy do you think it could be the best option to achieve that? At the moment I was thinking to persist the ids and use a temporary table with an autoincrement long id but maybe there's a simpler solution.. Best, Flavio On Tue, Nov 3, 2015 at 9:59 AM, Martin Junghanns <[hidden email]> wrote: Hi Flavio, Flavio Pompermaier Phone: +(39) 0461 283 702 |
Hi,
Sounds like RDF problems to me :) To build an index you could do the following: triplet := <a,b,c> (0) build set of all triplets (with strings) triplets = triplets1.union(triplets2) (1) assign unique long ids to each vertex vertices = triplets.flatMap() => [<a>,<c>,...].distinct() vertexWithID = vertices.zipWithUniqueID() => [<1,a>,<2,c>] (2) for each triplet dataset // update source and target identifier in triplet dataset triplets1.join(vertexWithID) .where(0) // <a,b,c> .equalTo(1) // <1,a> .with(/* replace source string with unique id */) .join(vertexWithId) .where(2) // <1,b,c> .equalTo(1) // <2,c> .with(/* replace target string with unique id */) => <1,b,2> (3) store updated triplet sets for later processing Of course this is a lot of computational effort, but it needs to be done once and you have an index of your graphs which you can use for further processing. If your job contains only one join between the triplet datasets, this is clearly not an option. Just an idea :) Best, Martin On 03.11.2015 10:10, Flavio Pompermaier wrote: > Hi Martin, > thanks for the suggestion but unfortunately in my use case I have another > problem: I have to join triplets when f2==f0..is there any way to translate > also references? i.e. if I have 2 tuples <a,b,c> <c,d,e>, when I apply that > function I obtain something like <1,<a,b,c>>,<2,<c,d,e>>. > I'd like to be able to join the 1st tuple with the 2nd (so I should know > that c=>2). > Which strategy do you think it could be the best option to achieve that? > At the moment I was thinking to persist the ids and use a temporary table > with an autoincrement long id but maybe there's a simpler solution.. > > Best, > Flavio > > On Tue, Nov 3, 2015 at 9:59 AM, Martin Junghanns <[hidden email]> > wrote: > >> Hi Flavio, >> >> If you just want to assign a unique Long identifier to each element in >> your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1]. >> >> Best, >> Martin >> >> [1] >> >> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131 >> >> On 03.11.2015 09:42, Flavio Pompermaier wrote: >>> Hi to all, >>> >>> I was reading the thread about the Neo4j connector and an old question >> came >>> to my mind. >>> >>> In my Flink job I have Tuples with String ids that I use to join on that >>> I'd like to convert to long (because Flink should improve quite a lot the >>> memory usage and the processing time if I'm not wrong). >>> Is there any recommended way to do that conversion in Flink? >>> >>> Best, >>> Flavio >>> >> > > > |
In reply to this post by Martin Junghanns-2
Converting String ids into Long ids can be quite expensive, so you should make sure it pays off. The save way to do it is to get all unique String ids (project, distinct), do zipWithUniqueId, and join all DataSets that have the String id with the new long id. So it is a full sort for the unique and as many joins (on the String field) as you have DataSets where you want to replace the String id.2015-11-03 9:59 GMT+01:00 Martin Junghanns <[hidden email]>: Hi Flavio, |
Free forum by Nabble | Edit this page |