Hi guys,
I am using Flink on my project and I have a question. (I am using Java) Is it possible to modify the keyby method in order to key by similarities and not by the exact name? Example: I recieve 2 DataStreams, in the first one , the name of the field that I want to KeyBy is "John Locke", while in the Datastream the field value is "John L". Can I use some java library to find for similarities between strings and if the similitude is high, then key those elements together.
|
Hey Iñaki,
you can use the KeySelector as described here: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys But you only a local view for the current element, e.g. the library you use to determine the similarity has to know the similarities upfront. – Ufuk On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams <[hidden email]> wrote: > Hi guys, > > I am using Flink on my project and I have a question. (I am using Java) > > Is it possible to modify the keyby method in order to key by similarities > and not by the exact name? > > Example: I recieve 2 DataStreams, in the first one , the name of the field > that I want to KeyBy is "John Locke", while in the Datastream the field > value is "John L". Can I use some java library to find for similarities > between strings and if the similitude is high, then key those elements > together. |
Thanks for your answer Ufuk.
However, I have been reading about KeySelector and I don't understand completely how it works with my idea. I am using an algorithm that gives me an score between some different strings. My idea is: if the score is higher than 0'80 for example, then those two strings will be consider the same and when I apply the keyby("name") those similar string will be keyed as they have the exact same name.
El lunes, 6 de junio de 2016, Ufuk Celebi <[hidden email]> escribió: Hey Iñaki, |
the idea behind key-selectors is to
extract a property on which you can to equality comparisons.
let's get one question out of the way first: is your scoring algorithm transitive? as in if A==B and B==C, is it a given that A==C? because if not, there's just no way to group(=partition) the data, since B would belong to 2 distinct groups. Even if it did work, one thing you have to realize is that this wouldn't scale at all. For every element that comes in you would have to compare it to all other groups you have created so far. What i would propose is the following: create a key-selector that allows a rough grouping of your data. something like "John L" => "J L". On that group (that is hopefully relatively small) you can then fire up your algorithm between all possible pairs to do whatever you wanna do. On 07.06.2016 10:48, iñaki williams wrote: Thanks for your answer Ufuk. |
Free forum by Nabble | Edit this page |