Delay in Flink timers

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Delay in Flink timers

Narendra Joshi
Hi,

We are using Flink as a timer scheduler and delay in timer execution is
a huge problem for us. What we have experienced is that as the number of
Timers we register increases the timers start getting delayed (for more
than 5 seconds). Can anyone point us in the right direction to figure
out what might be happening?

I have been told that `onTimer` and `processElement` are called with a
mutually exclusive lock. Could this locking be the reason this is
happening? In both the functions there is no IO happening and it should
not take 5 seconds.

Is it possible that calls to `processElement` starve `onTimer` calls?


--
Narendra Joshi
Reply | Threaded
Open this post in threaded view
|

Re: Delay in Flink timers

Chesnay Schepler-2
It is true that onTimer and processElement are never called at the same
time.

I'm not entirely sure whether there is any prioritization/fairness
between these methods
(if not if could be that onTimer is starved) , looping in Aljoscha who
hopefully knows more
about this.

On 10.09.2017 09:31, Narendra Joshi wrote:

> Hi,
>
> We are using Flink as a timer scheduler and delay in timer execution is
> a huge problem for us. What we have experienced is that as the number of
> Timers we register increases the timers start getting delayed (for more
> than 5 seconds). Can anyone point us in the right direction to figure
> out what might be happening?
>
> I have been told that `onTimer` and `processElement` are called with a
> mutually exclusive lock. Could this locking be the reason this is
> happening? In both the functions there is no IO happening and it should
> not take 5 seconds.
>
> Is it possible that calls to `processElement` starve `onTimer` calls?
>
>
> --
> Narendra Joshi
>

Reply | Threaded
Open this post in threaded view
|

Re: Delay in Flink timers

Aljoscha Krettek
Hi,

Yes, execution of these methods is protected by a synchronized block. This is not a fair lock so incoming data might starve timer callbacks. What is the number of timers we are talking about here?

Best,
Aljoscha

> On 11. Sep 2017, at 19:38, Chesnay Schepler <[hidden email]> wrote:
>
> It is true that onTimer and processElement are never called at the same time.
>
> I'm not entirely sure whether there is any prioritization/fairness between these methods
> (if not if could be that onTimer is starved) , looping in Aljoscha who hopefully knows more
> about this.
>
> On 10.09.2017 09:31, Narendra Joshi wrote:
>> Hi,
>>
>> We are using Flink as a timer scheduler and delay in timer execution is
>> a huge problem for us. What we have experienced is that as the number of
>> Timers we register increases the timers start getting delayed (for more
>> than 5 seconds). Can anyone point us in the right direction to figure
>> out what might be happening?
>>
>> I have been told that `onTimer` and `processElement` are called with a
>> mutually exclusive lock. Could this locking be the reason this is
>> happening? In both the functions there is no IO happening and it should
>> not take 5 seconds.
>>
>> Is it possible that calls to `processElement` starve `onTimer` calls?
>>
>>
>> --
>> Narendra Joshi
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Delay in Flink timers

Narendra Joshi

The number of timers is about 400 per second. We have observed that onTimer calls are delayed only when the number of scheduled timers starts increasing from a minima. It would be great if you can share pointers to code I can look at to understand it better. :)

Narendra Joshi

On 14 Sep 2017 16:04, "Aljoscha Krettek" <[hidden email]> wrote:
Hi,

Yes, execution of these methods is protected by a synchronized block. This is not a fair lock so incoming data might starve timer callbacks. What is the number of timers we are talking about here?

Best,
Aljoscha

> On 11. Sep 2017, at 19:38, Chesnay Schepler <[hidden email]> wrote:
>
> It is true that onTimer and processElement are never called at the same time.
>
> I'm not entirely sure whether there is any prioritization/fairness between these methods
> (if not if could be that onTimer is starved) , looping in Aljoscha who hopefully knows more
> about this.
>
> On 10.09.2017 09:31, Narendra Joshi wrote:
>> Hi,
>>
>> We are using Flink as a timer scheduler and delay in timer execution is
>> a huge problem for us. What we have experienced is that as the number of
>> Timers we register increases the timers start getting delayed (for more
>> than 5 seconds). Can anyone point us in the right direction to figure
>> out what might be happening?
>>
>> I have been told that `onTimer` and `processElement` are called with a
>> mutually exclusive lock. Could this locking be the reason this is
>> happening? In both the functions there is no IO happening and it should
>> not take 5 seconds.
>>
>> Is it possible that calls to `processElement` starve `onTimer` calls?
>>
>>
>> --
>> Narendra Joshi
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Delay in Flink timers

Narendra Joshi
I have a couple of questions related to this:

1. We store state per key (Rocksdb backend). Currently, the state size
is ~1.5Gb. Checkpointing time sometimes reaches ~10-20 seconds. Is it
possible that checkpointing is affecting timer execution?
2. Does checkpointing cause Flink to stop consumption of data streams
(say from Kafka)? We have observed that when the timers are delayed,
there is delay in picking up messages from Kafka.
3. Are there any metrics exposed by Flink that could help us
understand better where the delay is coming from? Is there a metric
for knowing about contention between `processElement` and `onTimer`?
4. Is there a plan for moving from Scheduled Threadpool Executor to
using timing wheels for timeout?

If there is any other information that you need, please let me know.

On Tue, Sep 19, 2017 at 10:37 PM, Narendra Joshi <[hidden email]> wrote:

> The number of timers is about 400 per second. We have observed that onTimer
> calls are delayed only when the number of scheduled timers starts increasing
> from a minima. It would be great if you can share pointers to code I can
> look at to understand it better. :)
>
> Narendra Joshi
>
> On 14 Sep 2017 16:04, "Aljoscha Krettek" <[hidden email]> wrote:
>>
>> Hi,
>>
>> Yes, execution of these methods is protected by a synchronized block. This
>> is not a fair lock so incoming data might starve timer callbacks. What is
>> the number of timers we are talking about here?
>>
>> Best,
>> Aljoscha
>>
>> > On 11. Sep 2017, at 19:38, Chesnay Schepler <[hidden email]> wrote:
>> >
>> > It is true that onTimer and processElement are never called at the same
>> > time.
>> >
>> > I'm not entirely sure whether there is any prioritization/fairness
>> > between these methods
>> > (if not if could be that onTimer is starved) , looping in Aljoscha who
>> > hopefully knows more
>> > about this.
>> >
>> > On 10.09.2017 09:31, Narendra Joshi wrote:
>> >> Hi,
>> >>
>> >> We are using Flink as a timer scheduler and delay in timer execution is
>> >> a huge problem for us. What we have experienced is that as the number
>> >> of
>> >> Timers we register increases the timers start getting delayed (for more
>> >> than 5 seconds). Can anyone point us in the right direction to figure
>> >> out what might be happening?
>> >>
>> >> I have been told that `onTimer` and `processElement` are called with a
>> >> mutually exclusive lock. Could this locking be the reason this is
>> >> happening? In both the functions there is no IO happening and it should
>> >> not take 5 seconds.
>> >>
>> >> Is it possible that calls to `processElement` starve `onTimer` calls?
>> >>
>> >>
>> >> --
>> >> Narendra Joshi
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Delay in Flink timers

Stephan Ewen
Checkpoints are largely asynchronous, but the checkpointing of timers has some synchronous component (which we are currently working on getting rid of).
So when you have a lot of timers, streams stall for a short time while the timers are checkpointed. If all goes as planned, Flink 1.6 will not have that stall any more.

Concerning the delay on timers - I think that is not an issue of heaps / timer wheels, etc (timer wheels are not magically better at everything that has to do with timers).
This sounds more like the execution becomes contended. The reason for the contention could actually very well be the checkpointing of timers (stalling when too many timers are registered).


On Wed, Sep 20, 2017 at 2:53 PM, Narendra Joshi <[hidden email]> wrote:
I have a couple of questions related to this:

1. We store state per key (Rocksdb backend). Currently, the state size
is ~1.5Gb. Checkpointing time sometimes reaches ~10-20 seconds. Is it
possible that checkpointing is affecting timer execution?
2. Does checkpointing cause Flink to stop consumption of data streams
(say from Kafka)? We have observed that when the timers are delayed,
there is delay in picking up messages from Kafka.
3. Are there any metrics exposed by Flink that could help us
understand better where the delay is coming from? Is there a metric
for knowing about contention between `processElement` and `onTimer`?
4. Is there a plan for moving from Scheduled Threadpool Executor to
using timing wheels for timeout?

If there is any other information that you need, please let me know.

On Tue, Sep 19, 2017 at 10:37 PM, Narendra Joshi <[hidden email]> wrote:
> The number of timers is about 400 per second. We have observed that onTimer
> calls are delayed only when the number of scheduled timers starts increasing
> from a minima. It would be great if you can share pointers to code I can
> look at to understand it better. :)
>
> Narendra Joshi
>
> On 14 Sep 2017 16:04, "Aljoscha Krettek" <[hidden email]> wrote:
>>
>> Hi,
>>
>> Yes, execution of these methods is protected by a synchronized block. This
>> is not a fair lock so incoming data might starve timer callbacks. What is
>> the number of timers we are talking about here?
>>
>> Best,
>> Aljoscha
>>
>> > On 11. Sep 2017, at 19:38, Chesnay Schepler <[hidden email]> wrote:
>> >
>> > It is true that onTimer and processElement are never called at the same
>> > time.
>> >
>> > I'm not entirely sure whether there is any prioritization/fairness
>> > between these methods
>> > (if not if could be that onTimer is starved) , looping in Aljoscha who
>> > hopefully knows more
>> > about this.
>> >
>> > On 10.09.2017 09:31, Narendra Joshi wrote:
>> >> Hi,
>> >>
>> >> We are using Flink as a timer scheduler and delay in timer execution is
>> >> a huge problem for us. What we have experienced is that as the number
>> >> of
>> >> Timers we register increases the timers start getting delayed (for more
>> >> than 5 seconds). Can anyone point us in the right direction to figure
>> >> out what might be happening?
>> >>
>> >> I have been told that `onTimer` and `processElement` are called with a
>> >> mutually exclusive lock. Could this locking be the reason this is
>> >> happening? In both the functions there is no IO happening and it should
>> >> not take 5 seconds.
>> >>
>> >> Is it possible that calls to `processElement` starve `onTimer` calls?
>> >>
>> >>
>> >> --
>> >> Narendra Joshi
>> >>
>> >
>>
>