|
Hi,
We're working with problems in IoT domain and using Flink to address certain use cases (dominantly CEP). There are multiple devices (of the same type, for eg. a temperature sensor) which are continuously pushing events. These (N) devices are distinct and independent data sources, mostly residing at different geographical locations. Clocks of all the devices are based on network time (synced with NTP servers).
One of the pain points for us currently is, to create 'separate' data streams per device, as opposed to a single keyed stream (keyed on the device id), because there are external factors like network loss, device reboots etc, which cause certain devices (and eventually their respective data streams) to lag behind, and unfortunately it is not possible to predict an upper bound on this lag.
We want to leverage the Event Time functionality provided by the framework, and since per key watermarks are not supported, the number of data streams (hence the number of duplicate CEP/Window/etc. operators) is scaling linearly with the number of devices deployed. This is proving to be a major bottleneck for us.
Questions:
1. What is the reason behind not supporting per key watermarks? -- i.e. each operator can maintain N current time variables/timers etc as opposed to a single 'clock' variable. One of the reasons I guess could be related to "What happens if downstream datastreams and operators are not Keyed?". Is this the only limitation?
2. Is there some fundamental aspect of the framework which we have missed? It'll be really helpful if the community can point us to any existing case studies which are similar to the use case mentioned above, to ensure we are on the correct way forward.
Thanks, Shailesh
|