Hi, I want to implement my own operator that computes the Count-Min Sketch over a window in Flink. Then, I found this Jira issue [1] which is exactly what I want. I believe that I have to work out my skills to arrive at a mature solution. So, the first thing that comes to my mind is to create my custom operator like the AggregateApplyWindowFunction [2]. Through this I can create the summary of my data over a window. Also, I found this custom JoinOperator example by Till Rohrmann [3] which I think I can base my implementation since it is done over a DataStream. What are your suggestions to me in order to start to implement a custom stream operator which computes Count-Min Sketch? Do you have any custom operator over window/keyBy that I can learn with the source code? ps.: I have implemented (looking at Blink source code) this a custom Combiner [4] (map-combiner-reduce) operator. Thanks, Felipe |
Hi Felipe, In a short glance, the question can depend on how your window is (is there any overlap like sliding window) and how many data you would like to process. In general, you can always buffer all the data into a ListState and apply your window function by iterating through all those buffered elements [1]. Provided that the data size is small enough to be hold efficiently in the state-backend. If this algorithm has some sort of pre-aggregation that can simplify the buffering through an incremental, orderless aggregation, you can also think about using [2]. With these two approaches, you do not necessarily need to implement your own window operator (extending window operator can be tricky), and you also have access to the internal state [3]. Hope these helps your exploration. Thanks, Rong On Tue, Apr 23, 2019 at 8:16 AM Felipe Gutierrez <[hidden email]> wrote:
|
Hi Rong, thanks for your reply. I guess I already did something regarding what you have told to me. I have one example on this application [1], which uses this state [2] and computes a CountMinSketch [3]. I am seeking how to implement my own operator over a window in order to have more fine-grained control over it and learn with it. And hopefully, building a path to contribute to Flink in the future [4]. Best, Felipe On Wed, Apr 24, 2019 at 2:06 AM Rong Rong <[hidden email]> wrote:
|
Hi Felipe, I am not sure the algorithm requires to construct a new extension of the window operator. I think your implementation of the CountMinSketch object as an aggregator: E.g. 1. AggregateState (ACC) should be the aggregating accumulate count-min-sketch 2-D hash array (plus a few other needed fields). 2. accumulate method just simply do the update. 3. getResult simply get the frequency from sketch. Thus you will not need to use a customized ValueStateDescriptor. But I agree that maybe it is a good idea to support a class of use cases that requires approximate aggregate state (like HyperLogLog?), this might've been a good value add in my opinion. I think some further discussion is needed if we are going down that path. Do you think the original FLINK-2147 JIRA ticket is a good place to carry out that discussion? We can probably continue there or create a new JIRA for discussion. -- Rong On Wed, Apr 24, 2019 at 1:32 AM Felipe Gutierrez <[hidden email]> wrote:
|
Hi Rong, thanks for your insights. I agree with the three points that you said. My plan is to implement an operator that compute the Count-min sketch and developers can assign functions to increase the estimative of the sketch (adding more/different functions the sketch will be more precise, hence more heavy). But the operator will also hold default hash functions so the developer does not have to add any function with he does not want. Like I said, I will implement on my project. But I totally agree to keep the discussion on the original FLINK-2147 JIRA ticket. Doing so I can collect more opinions =) Thanks! Felipe On Fri, Apr 26, 2019 at 4:10 AM Rong Rong <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |