Hi All, I have the following questions. 1) can we do Flink CEP on event stream or batch? 2) If we can do streaming I wonder how long can we keep the stream stateful? I also wonder if anyone successfully had done any stateful streaming for days or months(with or without CEP)? or is stateful streaming is mainly to keep state only for a few hours? I have a use case where events are ingested from multiple sources and in theory, the sources are supposed to have the same events however in practice the sources will not have the same events so when the events are ingested from multiple sources the goal is to detect where the "breaks" are(meaning the missing events like exists in one source but not in other)? so I realize this is the typical case for CEP. Also, in this particular use case events that supposed to come 2 years ago can come today and if so, need to update those events also in real time or near real time. Sure there wouldn't be a lot of events that were missed 2 years ago but there will be a few. What would be the best approach? One solution I can think of is to do Stateful CEP with a window of one day or whatever short time period where most events will occur and collect the events that fall beyond that time period(The late ones) into some Kafka topic and have a separate stream analyze the time period of the late ones, construct the corresponding NFA and run through it again. Please let me know how this sounds or if there is a better way to do it. Thanks! |
Hi, Stateful streaming applications are typically designed to run continuously (i.e., until forever or until they are not needed anymore or replaced). May jobs run for weeks or months. IMO, using CEP for "simple" equality matches would add too much
complexity for a use case that can be easily solved with a stateful
function. If your task is to ensure that two streams have the same events, I'd recommend to implement a custom DataStream application with a stateful ProcessFunction. Holding state for two years is certainly possible if you know exactly which events to keep, i.e., you do not store the full stream but only those few events that have not had a match yet. If you need to run the same logic also on batch data, you might want to check if you can use SQL or the Table API which are designed to work on static and streaming data with the same processing semantics. Best, Fabian Am Di., 30. Apr. 2019 um 06:49 Uhr schrieb kant kodali <[hidden email]>:
|
Hi Fabian, Actually, now that I had gone through my use case I can say that the equality matches are more like expressions. for example the sum(col1, col2) of datasetA = col3 datasetB. And these expressions can include, sum, if & else, trim, substring, absolute_value etc.. and they are submitted by the user in Adhoc fashion. My job is to apply these expressions on two different streams and identify the breaks and report. Any suggestions would be appreciated. Thanks! On Tue, Apr 30, 2019 at 2:20 AM Fabian Hueske <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |