Hi Amit,
In my current approach the idea for updating rule set data was to have some kind of a "control" stream that will trigger an update to a local data structure, or a "control" event within the main data stream that will trigger the same.
Using external system like a cache or database is also an option, but that still will require some kind of a trigger to reload rule set or a single rule, in case of any updates to it.
Others have suggested using Flink managed state, but I'm still not sure whether that is a generally recommended approach in this scenario, as it seems like it was more meant for windowing-type processing instead?
Thanks,
Turar
On 6/5/18, 8:46 AM, "Amit Jain" <[hidden email]> wrote:
Hi Sandybayev,
In the current state, Flink does not provide a solution to the
mentioned use case. However, there is open FLIP[1] [2] which has been
created to address the same.
I can see in your current approach, you are not able to update the
rule set data. I think you can update rule set data by building
DataStream around changelogs which are stored in message
queue/distributed file system.
OR
You can store rule set data in the external system where you can query
for incoming keys from Flink.
[1]: https://cwiki.apache.org/confluence/display/FLINK/FLIP- 17+Side+Inputs+for+DataStream+ API
[2]: https://issues.apache.org/jira/browse/FLINK-6131
On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta)
<[hidden email]> wrote:
> Hi,
>
>
>
> What is the best practice recommendation for the following use case? We need
> to match a stream against a set of “rules”, which are essentially a Flink
> DataSet concept. Updates to this “rules set" are possible but not frequent.
> Each stream event must be checked against all the records in “rules set”,
> and each match produces one or more events into a sink. Number of records in
> a rule set are in the 6 digit range.
>
>
>
> Currently we're simply loading rules into a local List of rules and using
> flatMap over an incoming DataStream. Inside flatMap, we're just iterating
> over a list comparing each event to each rule.
>
>
>
> To speed up the iteration, we can also split the list into several batches,
> essentially creating a list of lists, and creating a separate thread to
> iterate over each sub-list (using Futures in either Java or Scala).
>
>
>
> Questions:
>
> 1. Is there a better way to do this kind of a join?
>
> 2. If not, is it safe to add additional parallelism by creating
> new threads inside each flatMap operation, on top of what Flink is already
> doing?
>
>
>
> Thanks in advance!
>
> Turar
>
>
Free forum by Nabble | Edit this page |