latency critical job

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

latency critical job

makeyang
some job is latency critical job which means it can't accept certain
threadhold of latency
so will flink provide timeout operator in near future which means when one
operator timeout, the jobmanager will schedule a new operator which starts
from previous state of the OP and keep dealing with new events and discard
the events are processing.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: latency critical job

Timo Walther
Hi,

usually Flink should have constant latency if the job is implemented
correctly. But if you want to implement something like an external
monitoring process., you can use the REST API [1] and metrics [2] to
model such an behavior by restarting your application. In theory, you
could also implement a mechanism in your job that throws a runtime
exception to cancel you job if a certain threshold is violated and thus
would trigger a restart if you set a restart strategy.

Regards,
Timo

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/rest_api.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/metrics.html

Am 25.05.18 um 09:56 schrieb makeyang:

> some job is latency critical job which means it can't accept certain
> threadhold of latency
> so will flink provide timeout operator in near future which means when one
> operator timeout, the jobmanager will schedule a new operator which starts
> from previous state of the OP and keep dealing with new events and discard
> the events are processing.
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply | Threaded
Open this post in threaded view
|

Re: latency critical job

Rong Rong
Hi Makeyang,

+1 on Timo's point. We have been dealing with these kind of problems before and in general Flink can handle latency if implemented correctly and assigned correct amount of computation resource (depending on what kind of resource isolation/containerization you are doing) to handle additional traffic spikes. In addition we have our external monitoring process on top of [2] to provide extra auto-scaling capability.

Can you share more information regarding which operator as actually causing the latency to pile up (you should be able to fetch that using [1]) and which Flink version you are using?

Thanks,
Rong



On Fri, May 25, 2018 at 6:23 AM, Timo Walther <[hidden email]> wrote:
Hi,

usually Flink should have constant latency if the job is implemented correctly. But if you want to implement something like an external monitoring process., you can use the REST API [1] and metrics [2] to model such an behavior by restarting your application. In theory, you could also implement a mechanism in your job that throws a runtime exception to cancel you job if a certain threshold is violated and thus would trigger a restart if you set a restart strategy.

Regards,
Timo

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/rest_api.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/metrics.html

Am 25.05.18 um 09:56 schrieb makeyang:

some job is latency critical job which means it can't accept certain
threadhold of latency
so will flink provide timeout operator in near future which means when one
operator timeout, the jobmanager will schedule a new operator which starts
from previous state of the OP and keep dealing with new events and discard
the events are processing.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



Reply | Threaded
Open this post in threaded view
|

Re: latency critical job

makeyang
Timo:
thanks for u suggestion



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: latency critical job

makeyang
Rong Rong:
    my flink version is 1.4.2
    since we are using the docker env which is sharing disk-io, based on our
observation, disk-io spike cased by other process in the same physical
machine can lead to long time operator processing.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/