Long ids from String

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Long ids from String

Flavio Pompermaier
Hi to all,

I was reading the thread about the Neo4j connector and an old question came to my mind.

In my Flink job I have Tuples with String ids that I use to join on that I'd like to convert to long (because Flink should improve quite a lot the memory usage and the processing time if I'm not wrong). 
Is there any recommended way to do that conversion in Flink?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Long ids from String

Martin Junghanns-2
Hi Flavio,

If you just want to assign a unique Long identifier to each element in
your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].

Best,
Martin

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131

On 03.11.2015 09:42, Flavio Pompermaier wrote:

> Hi to all,
>
> I was reading the thread about the Neo4j connector and an old question came
> to my mind.
>
> In my Flink job I have Tuples with String ids that I use to join on that
> I'd like to convert to long (because Flink should improve quite a lot the
> memory usage and the processing time if I'm not wrong).
> Is there any recommended way to do that conversion in Flink?
>
> Best,
> Flavio
>
Reply | Threaded
Open this post in threaded view
|

Re: Long ids from String

Flavio Pompermaier
Hi Martin,
thanks for the suggestion but unfortunately in my use case I have another problem: I have to join triplets when f2==f0..is there any way to translate also references? i.e. if I have 2 tuples <a,b,c> <c,d,e>, when I apply that function I obtain something like <1,<a,b,c>>,<2,<c,d,e>>.
I'd like to be able to join the 1st tuple with the 2nd (so I should know that c=>2).
Which strategy do you think it could be the best option to achieve that?
At the moment I was thinking to persist the ids and use a temporary table with an autoincrement long id but maybe there's a simpler solution..

Best,
Flavio

On Tue, Nov 3, 2015 at 9:59 AM, Martin Junghanns <[hidden email]> wrote:
Hi Flavio,

If you just want to assign a unique Long identifier to each element in
your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].

Best,
Martin

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131

On 03.11.2015 09:42, Flavio Pompermaier wrote:
> Hi to all,
>
> I was reading the thread about the Neo4j connector and an old question came
> to my mind.
>
> In my Flink job I have Tuples with String ids that I use to join on that
> I'd like to convert to long (because Flink should improve quite a lot the
> memory usage and the processing time if I'm not wrong).
> Is there any recommended way to do that conversion in Flink?
>
> Best,
> Flavio
>



--

Flavio Pompermaier
Development Department
_______________________________________________
OKKAMSrl www.okkam.it

Phone: +(39) 0461 283 702
Fax: + (39) 0461 186 6433
Email: [hidden email]
Headquarters: Trento (Italy), via G.B. Trener 8
Registered office: Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.

Reply | Threaded
Open this post in threaded view
|

Re: Long ids from String

Martin Junghanns-2
Hi,

Sounds like RDF problems to me :)

To build an index you could do the following:

triplet := <a,b,c>

(0) build set of all triplets (with strings)
triplets = triplets1.union(triplets2)

(1) assign unique long ids to each vertex
vertices = triplets.flatMap() => [<a>,<c>,...].distinct()
vertexWithID = vertices.zipWithUniqueID() => [<1,a>,<2,c>]

(2) for each triplet dataset

// update source and target identifier in triplet dataset
triplets1.join(vertexWithID)
        .where(0) // <a,b,c>
        .equalTo(1) // <1,a>
        .with(/* replace source string with unique id */)
        .join(vertexWithId)
        .where(2) // <1,b,c>
        .equalTo(1) // <2,c>
        .with(/* replace target string with unique id */)

=> <1,b,2>

(3) store updated triplet sets for later processing

Of course this is a lot of computational effort, but it needs to be done
once and you have an index of your graphs which you can use for further
processing.

If your job contains only one join between the triplet datasets, this is
clearly not an option.

Just an idea :)

Best,
Martin


On 03.11.2015 10:10, Flavio Pompermaier wrote:

> Hi Martin,
> thanks for the suggestion but unfortunately in my use case I have another
> problem: I have to join triplets when f2==f0..is there any way to translate
> also references? i.e. if I have 2 tuples <a,b,c> <c,d,e>, when I apply that
> function I obtain something like <1,<a,b,c>>,<2,<c,d,e>>.
> I'd like to be able to join the 1st tuple with the 2nd (so I should know
> that c=>2).
> Which strategy do you think it could be the best option to achieve that?
> At the moment I was thinking to persist the ids and use a temporary table
> with an autoincrement long id but maybe there's a simpler solution..
>
> Best,
> Flavio
>
> On Tue, Nov 3, 2015 at 9:59 AM, Martin Junghanns <[hidden email]>
> wrote:
>
>> Hi Flavio,
>>
>> If you just want to assign a unique Long identifier to each element in
>> your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].
>>
>> Best,
>> Martin
>>
>> [1]
>>
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131
>>
>> On 03.11.2015 09:42, Flavio Pompermaier wrote:
>>> Hi to all,
>>>
>>> I was reading the thread about the Neo4j connector and an old question
>> came
>>> to my mind.
>>>
>>> In my Flink job I have Tuples with String ids that I use to join on that
>>> I'd like to convert to long (because Flink should improve quite a lot the
>>> memory usage and the processing time if I'm not wrong).
>>> Is there any recommended way to do that conversion in Flink?
>>>
>>> Best,
>>> Flavio
>>>
>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Long ids from String

Fabian Hueske-2
In reply to this post by Martin Junghanns-2
Converting String ids into Long ids can be quite expensive, so you should make sure it pays off.

The save way to do it is to get all unique String ids (project, distinct), do zipWithUniqueId, and join all DataSets that have the String id with the new long id. So it is a full sort for the unique and as many joins (on the String field) as you have DataSets where you want to replace the String id.

Cheers, Fabian

2015-11-03 9:59 GMT+01:00 Martin Junghanns <[hidden email]>:
Hi Flavio,

If you just want to assign a unique Long identifier to each element in
your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].

Best,
Martin

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131

On 03.11.2015 09:42, Flavio Pompermaier wrote:
> Hi to all,
>
> I was reading the thread about the Neo4j connector and an old question came
> to my mind.
>
> In my Flink job I have Tuples with String ids that I use to join on that
> I'd like to convert to long (because Flink should improve quite a lot the
> memory usage and the processing time if I'm not wrong).
> Is there any recommended way to do that conversion in Flink?
>
> Best,
> Flavio
>