(DEPRECATED) Apache Flink User Mailing List archive.

Custom keyBy(), look for similaties

Classic

List

Threaded

4 messages Options

juanramallo80

Custom keyBy(), look for similaties

Hi guys,

I am using Flink on my project and I have a question. (I am using Java)

Is it possible to modify the keyby method in order to key by similarities and not by the exact name?

Example: I recieve 2 DataStreams, in the first one , the name of the field that I want to KeyBy is "John Locke", while in the Datastream the field value is "John L". Can I use some java library to find for similarities between strings and if the similitude is high, then key those elements together.

Ufuk Celebi

Re: Custom keyBy(), look for similaties

Hey Iñaki,

you can use the KeySelector as described here:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys

But you only a local view for the current element, e.g. the library
you use to determine the similarity has to know the similarities
upfront.

– Ufuk

On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams <[hidden email]> wrote:

> Hi guys,
>
> I am using Flink on my project and I have a question. (I am using Java)
>
> Is it possible to modify the keyby method in order to key by similarities
> and not by the exact name?
>
> Example: I recieve 2 DataStreams, in the first one , the name of the field
> that I want to KeyBy is "John Locke", while in the Datastream the field
> value is "John L". Can I use some java library to find for similarities
> between strings and if the similitude is high, then key those elements
> together.

juanramallo80

Re: Custom keyBy(), look for similaties

Thanks for your answer Ufuk.

However, I have been reading about KeySelector and I don't understand completely how it works with my idea.

I am using an algorithm that gives me an score between some different strings. My idea is: if the score is higher than 0'80 for example, then those two strings will be consider the same and when I apply the keyby("name") those similar string will be keyed as they have the exact same name.

El lunes, 6 de junio de 2016, Ufuk Celebi <[hidden email]> escribió:

Hey Iñaki,

you can use the KeySelector as described here:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys

But you only a local view for the current element, e.g. the library
you use to determine the similarity has to know the similarities
upfront.

– Ufuk

On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams <<a href="javascript:;" onclick="_e(event, 'cvml', 'juanramallo80@gmail.com')">juanramallo80@...> wrote:
> Hi guys,
>
> I am using Flink on my project and I have a question. (I am using Java)
>
> Is it possible to modify the keyby method in order to key by similarities
> and not by the exact name?
>
> Example: I recieve 2 DataStreams, in the first one , the name of the field
> that I want to KeyBy is "John Locke", while in the Datastream the field
> value is "John L". Can I use some java library to find for similarities
> between strings and if the similitude is high, then key those elements
> together.

Chesnay Schepler

Re: Custom keyBy(), look for similaties

the idea behind key-selectors is to extract a property on which you can to equality comparisons.

let's get one question out of the way first:
is your scoring algorithm transitive? as in if A==B and B==C, is it a given that A==C? because if not, there's
just no way to group(=partition) the data, since B would belong to 2 distinct groups.

Even if it did work, one thing you have to realize is that this wouldn't scale at all. For every element that
comes in you would have to compare it to all other groups you have created so far.

What i would propose is the following: create a key-selector that allows a rough grouping of your data.
something like "John L" => "J L". On that group (that is hopefully relatively small) you can then fire up your
algorithm between all possible pairs to do whatever you wanna do.

On 07.06.2016 10:48, iñaki williams wrote:

Thanks for your answer Ufuk.

However, I have been reading about KeySelector and I don't understand completely how it works with my idea.

I am using an algorithm that gives me an score between some different strings. My idea is: if the score is higher than 0'80 for example, then those two strings will be consider the same and when I apply the keyby("name") those similar string will be keyed as they have the exact same name.

El lunes, 6 de junio de 2016, Ufuk Celebi <[hidden email]> escribió:

Hey Iñaki,

you can use the KeySelector as described here:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys

But you only a local view for the current element, e.g. the library
you use to determine the similarity has to know the similarities
upfront.

– Ufuk

On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams <<a moz-do-not-send="true" href="javascript:;" onclick="_e(event, 'cvml', 'juanramallo80@gmail.com')">[hidden email]> wrote:
> Hi guys,
>
> I am using Flink on my project and I have a question. (I am using Java)
>
> Is it possible to modify the keyby method in order to key by similarities
> and not by the exact name?
>
> Example: I recieve 2 DataStreams, in the first one , the name of the field
> that I want to KeyBy is "John Locke", while in the Datastream the field
> value is "John L". Can I use some java library to find for similarities
> between strings and if the similitude is high, then key those elements
> together.