Hello All,
I have a streaming job running in production which is processing over 2 billion events per day and it does some heavy processing on each event. We have been facing some challenges in managing flink in production like scaling in and out, restarting the job with savepoint etc. Flink provides a lot of features which seemed as an obvious choice at that time but now with all the operational overhead we are thinking should we still use flink for our stream processing requirements or choose kafka streams. We currently deploy flink on ECR. Bringing up a new cluster for another stream job is too expensive but on the flip side running it on the same cluster becomes difficult since there are no ways to say this job has to be run on a dedicated server versus this can run on a shared instance. Also savepoint point, cancel and submit a new job results in some downtime. The most critical part being there is no shared state among all tasks sort of a global state. We sort of achieve this today using an external redis cache but that incurs cost as well. If we are moving to kafka streams, it makes our deployment life much easier, each new stream job will be a microservice that can scale independently. With global state it's much easier to share state without using external cache. But the disadvantage is we have to rely on the partitions for parallelism. Although this might initially sound easier, when we need to scale much higher this will become a bottleneck. Do you guys have any suggestions on this? We need to decide which way to move forward and any suggestions would be of much greater help. Thanks |
Hi From your description, seems the big problem is scale in and out, and there maybe a big downtime for trigger savepoint and restore from the savepoint. Previously, we have proposed a feature named stop-with-checkpoint[1] same as the stop-with-savepoint, but triggering a checkpoint instead of savepoint, if you use incremental checkpoint, this can improve the speed for much. Currently, as this feature did not merged, you can try to restore from the retained checkpoint from previous job[2] For scale in and scale out, if the restore time cost too much, you can measure the time of restore, if spends too much time on downloading states, you can try the multi-thread download feature[3]. Navneeth Krishnan <[hidden email]> 于2019年11月8日周五 下午3:38写道:
|
Thanks Congxian. Yes, its been very hard to manage the cluster and thats why we are trying to evaluate alternate choices. If anyone has found better methods to deploy and scale, it would be great to know so that we can adopt the same as well. Thanks On Fri, Nov 8, 2019 at 1:56 AM Congxian Qiu <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |