[Proposal] CEP library changes - review request

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Proposal] CEP library changes - review request

Shailesh Jain
Hi,

We've been facing issues* w.r.t watermarks not supported per key, which led us to:

Either (a) run the job in Processing time for a KeyedStream -> compromising on use cases which revolve around catching time-based patterns
or (b) run the job in Event time for multiple data streams (one data stream per key) -> this is not scalable as the number of operators grow linearly with the number of keys

To address this, we've done a quick (poc) change in the AbstractKeyedCEPPatternOperator to allow for the NFAs to progress based on timestamps extracted from the events arriving into the operator (and not from the watermarks). We've tested it against our usecase and are seeing a significant improvement in memory usage without compromising on the watermark functionality.

It'll be really helpful if someone from the cep dev group can take a look at this branch - https://github.com/jainshailesh/flink/commits/cep_changes and provide comments on the approach taken, and maybe guide us on the next steps for taking it forward.

Thanks,
Shailesh

* Links to previous email threads related to the same issue:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Question-on-event-time-functionality-using-Flink-in-a-IoT-usecase-td18653.html
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Generate-watermarks-per-key-in-a-KeyedStream-td16629.html
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Correlation-between-number-of-operators-and-Job-manager-memory-requirements-td18384.html