(DEPRECATED) Apache Flink User Mailing List archive.

Long ids from String

Classic

List

Threaded

5 messages Options

Flavio Pompermaier

Long ids from String

Hi to all,

I was reading the thread about the Neo4j connector and an old question came to my mind.

In my Flink job I have Tuples with String ids that I use to join on that I'd like to convert to long (because Flink should improve quite a lot the memory usage and the processing time if I'm not wrong).

Is there any recommended way to do that conversion in Flink?

Best,

Flavio

Martin Junghanns-2

Re: Long ids from String

Hi Flavio,

If you just want to assign a unique Long identifier to each element in
your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].

Best,
Martin

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131

On 03.11.2015 09:42, Flavio Pompermaier wrote:

> Hi to all,
>
> I was reading the thread about the Neo4j connector and an old question came
> to my mind.
>
> In my Flink job I have Tuples with String ids that I use to join on that
> I'd like to convert to long (because Flink should improve quite a lot the
> memory usage and the processing time if I'm not wrong).
> Is there any recommended way to do that conversion in Flink?
>
> Best,
> Flavio
>

Flavio Pompermaier

Re: Long ids from String

Hi Martin,

thanks for the suggestion but unfortunately in my use case I have another problem: I have to join triplets when f2==f0..is there any way to translate also references? i.e. if I have 2 tuples <a,b,c> <c,d,e>, when I apply that function I obtain something like <1,<a,b,c>>,<2,<c,d,e>>.

I'd like to be able to join the 1st tuple with the 2nd (so I should know that c=>2).

Which strategy do you think it could be the best option to achieve that?

At the moment I was thinking to persist the ids and use a temporary table with an autoincrement long id but maybe there's a simpler solution..

Best,

Flavio

On Tue, Nov 3, 2015 at 9:59 AM, Martin Junghanns <[hidden email]> wrote:

Hi Flavio,

If you just want to assign a unique Long identifier to each element in
your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].

Best,
Martin

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131

On 03.11.2015 09:42, Flavio Pompermaier wrote:
> Hi to all,
>
> I was reading the thread about the Neo4j connector and an old question came
> to my mind.
>
> In my Flink job I have Tuples with String ids that I use to join on that
> I'd like to convert to long (because Flink should improve quite a lot the
> memory usage and the processing time if I'm not wrong).
> Is there any recommended way to do that conversion in Flink?
>
> Best,
> Flavio
>

Flavio Pompermaier
Development Department
_______________________________________________
OKKAMSrl - www.okkam.it

Phone: +(39) 0461 283 702
Fax: + (39) 0461 186 6433
Email: [hidden email]
Headquarters: Trento (Italy), via G.B. Trener 8
Registered office: Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.

Martin Junghanns-2

Re: Long ids from String

Hi,

Sounds like RDF problems to me :)

To build an index you could do the following:

triplet := <a,b,c>

(0) build set of all triplets (with strings)
triplets = triplets1.union(triplets2)

(1) assign unique long ids to each vertex
vertices = triplets.flatMap() => [<a>,<c>,...].distinct()
vertexWithID = vertices.zipWithUniqueID() => [<1,a>,<2,c>]

(2) for each triplet dataset

// update source and target identifier in triplet dataset
triplets1.join(vertexWithID)
.where(0) // <a,b,c>
.equalTo(1) // <1,a>
.with(/* replace source string with unique id */)
.join(vertexWithId)
.where(2) // <1,b,c>
.equalTo(1) // <2,c>
.with(/* replace target string with unique id */)

=> <1,b,2>

(3) store updated triplet sets for later processing

Of course this is a lot of computational effort, but it needs to be done
once and you have an index of your graphs which you can use for further
processing.

If your job contains only one join between the triplet datasets, this is
clearly not an option.

Just an idea :)

Best,
Martin

On 03.11.2015 10:10, Flavio Pompermaier wrote:

> Hi Martin,
> thanks for the suggestion but unfortunately in my use case I have another
> problem: I have to join triplets when f2==f0..is there any way to translate
> also references? i.e. if I have 2 tuples <a,b,c> <c,d,e>, when I apply that
> function I obtain something like <1,<a,b,c>>,<2,<c,d,e>>.
> I'd like to be able to join the 1st tuple with the 2nd (so I should know
> that c=>2).
> Which strategy do you think it could be the best option to achieve that?
> At the moment I was thinking to persist the ids and use a temporary table
> with an autoincrement long id but maybe there's a simpler solution..
>
> Best,
> Flavio
>
> On Tue, Nov 3, 2015 at 9:59 AM, Martin Junghanns <[hidden email]>
> wrote:
>
>> Hi Flavio,
>>
>> If you just want to assign a unique Long identifier to each element in
>> your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].
>>
>> Best,
>> Martin
>>
>> [1]
>>
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131
>>
>> On 03.11.2015 09:42, Flavio Pompermaier wrote:
>>> Hi to all,
>>>
>>> I was reading the thread about the Neo4j connector and an old question
>> came
>>> to my mind.
>>>
>>> In my Flink job I have Tuples with String ids that I use to join on that
>>> I'd like to convert to long (because Flink should improve quite a lot the
>>> memory usage and the processing time if I'm not wrong).
>>> Is there any recommended way to do that conversion in Flink?
>>>
>>> Best,
>>> Flavio
>>>
>>
>
>
>

Fabian Hueske-2

Re: Long ids from String

In reply to this post by Martin Junghanns-2

Converting String ids into Long ids can be quite expensive, so you should make sure it pays off.

The save way to do it is to get all unique String ids (project, distinct), do zipWithUniqueId, and join all DataSets that have the String id with the new long id. So it is a full sort for the unique and as many joins (on the String field) as you have DataSets where you want to replace the String id.

Cheers, Fabian

2015-11-03 9:59 GMT+01:00 Martin Junghanns <[hidden email]>:

Hi Flavio,

If you just want to assign a unique Long identifier to each element in
your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].

Best,
Martin

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131

On 03.11.2015 09:42, Flavio Pompermaier wrote:
> Hi to all,
>
> I was reading the thread about the Neo4j connector and an old question came
> to my mind.
>
> In my Flink job I have Tuples with String ids that I use to join on that
> I'd like to convert to long (because Flink should improve quite a lot the
> memory usage and the processing time if I'm not wrong).
> Is there any recommended way to do that conversion in Flink?
>
> Best,
> Flavio
>