Flink cluster deployment strategy

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink cluster deployment strategy

sidhant gupta
Hi all,

I'm kind of new to flink cluster deployment. I wanted to know which flink cluster deployment and which job mode in aws is better in terms of ease of deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR vs ECS (docker containers). We have a usecase of setting up a data streaming api which reads records from a Kafka topic, process it and then write to a another Kafka topic. Please let me know your thoughts on this.

Thanks 
Sidhant Gupta


Reply | Threaded
Open this post in threaded view
|

Re: Flink cluster deployment strategy

Till Rohrmann
Hi Sidhant,

I am not an expert on AWS services but I believe that EMR might be a bit easier to start with since AWS EMR comes with Flink support out of the box [1]. On ECS I believe that you would have to set up the containers yourself. Another interesting deployment option could be to use Flink's native Kubernetes integration [2] which would work on AWS EKS.


Cheers,
Till

On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta <[hidden email]> wrote:
Hi all,

I'm kind of new to flink cluster deployment. I wanted to know which flink
cluster deployment and which job mode in aws is better in terms of ease of
deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
vs ECS (docker containers). We have a usecase of setting up a data
streaming api which reads records from a Kafka topic, process it and then
write to a another Kafka topic. Please let me know your thoughts on this.

Thanks
Sidhant Gupta
Reply | Threaded
Open this post in threaded view
|

Re: Flink cluster deployment strategy

sidhant gupta
Hi Till,

Thanks for your response. 
I have few queries though as mentioned below:
(1) Can flink be used in map-reduce fashion with data streaming api ? 
(2) Does it make sense to use aws EMR if we are not using flink in map-reduce fashion with streaming api ?
(3) Can flink cluster be auto scaled using EMR Managed Scaling when used with yarn as per this link https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/ ?
(4) If we set an explicit max parallelism, and set current parallelism (which might be less than the max parallelism) equal to the maximum number of slots and set slots per task manager while starting the yarn session, then if we increase the task manager as per auto scaling then does the parallelism would increase (till the max parallelism ) and the load would be distributed across the newly spined up task manager ? Refer: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism 

Regards 
Sidhant Gupta 

On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann, <[hidden email]> wrote:
Hi Sidhant,

I am not an expert on AWS services but I believe that EMR might be a bit easier to start with since AWS EMR comes with Flink support out of the box [1]. On ECS I believe that you would have to set up the containers yourself. Another interesting deployment option could be to use Flink's native Kubernetes integration [2] which would work on AWS EKS.


Cheers,
Till

On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta <[hidden email]> wrote:
Hi all,

I'm kind of new to flink cluster deployment. I wanted to know which flink
cluster deployment and which job mode in aws is better in terms of ease of
deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
vs ECS (docker containers). We have a usecase of setting up a data
streaming api which reads records from a Kafka topic, process it and then
write to a another Kafka topic. Please let me know your thoughts on this.

Thanks
Sidhant Gupta
Reply | Threaded
Open this post in threaded view
|

Re: Flink cluster deployment strategy

Till Rohrmann
Hi Sidhant,

see the inline comments for answers

On Tue, Aug 11, 2020 at 3:10 PM sidhant gupta <[hidden email]> wrote:
Hi Till,

Thanks for your response. 
I have few queries though as mentioned below:
(1) Can flink be used in map-reduce fashion with data streaming api ? 

What do you understand as map-reduce fashion? You can use Flink's DataSet API for processing batch workloads (consisting not only of map and reduce operations but also other operations such as groupReduce, flatMap, etc.). Flink's DataStream API can be used to process bounded and unbounded streaming data. 

(2) Does it make sense to use aws EMR if we are not using flink in map-reduce fashion with streaming api ?

I think I don't fully understand what you mean with map-reduce fashion. Do you mean multiple stages of map and reduce operations?
 
(3) Can flink cluster be auto scaled using EMR Managed Scaling when used with yarn as per this link https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/ ?

I am no expert on EMR managed scaling but I believe that it would need some custom tooling to scale a Flink job down (by taking a savepoint a resuming from it with a lower parallelism) before downsizing the EMR cluster.
 
(4) If we set an explicit max parallelism, and set current parallelism (which might be less than the max parallelism) equal to the maximum number of slots and set slots per task manager while starting the yarn session, then if we increase the task manager as per auto scaling then does the parallelism would increase (till the max parallelism ) and the load would be distributed across the newly spined up task manager ? Refer: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism 

At the moment, Flink does not support this out of the box but the community is working on this feature.

Regards 
Sidhant Gupta 

On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann, <[hidden email]> wrote:
Hi Sidhant,

I am not an expert on AWS services but I believe that EMR might be a bit easier to start with since AWS EMR comes with Flink support out of the box [1]. On ECS I believe that you would have to set up the containers yourself. Another interesting deployment option could be to use Flink's native Kubernetes integration [2] which would work on AWS EKS.


Cheers,
Till

On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta <[hidden email]> wrote:
Hi all,

I'm kind of new to flink cluster deployment. I wanted to know which flink
cluster deployment and which job mode in aws is better in terms of ease of
deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
vs ECS (docker containers). We have a usecase of setting up a data
streaming api which reads records from a Kafka topic, process it and then
write to a another Kafka topic. Please let me know your thoughts on this.

Thanks
Sidhant Gupta
Reply | Threaded
Open this post in threaded view
|

Re: Flink cluster deployment strategy

Arvid Heise-3
Hi Sidhant,

If you are starting fresh with Flink, I strongly recommend to skip ECS and EMR and directly go to a kubernetes-based solution. Scaling is much easier on K8s, there will be some kind of autoscaling coming in the next release, and the best of it all: you even have the option to go to a different cloud provider if needed.

The easiest option for you is to use EKS on AWS together with Ververica community edition [1] or with one of the many kubernetes operators.


On Tue, Aug 11, 2020 at 3:23 PM Till Rohrmann <[hidden email]> wrote:
Hi Sidhant,

see the inline comments for answers

On Tue, Aug 11, 2020 at 3:10 PM sidhant gupta <[hidden email]> wrote:
Hi Till,

Thanks for your response. 
I have few queries though as mentioned below:
(1) Can flink be used in map-reduce fashion with data streaming api ? 

What do you understand as map-reduce fashion? You can use Flink's DataSet API for processing batch workloads (consisting not only of map and reduce operations but also other operations such as groupReduce, flatMap, etc.). Flink's DataStream API can be used to process bounded and unbounded streaming data. 

(2) Does it make sense to use aws EMR if we are not using flink in map-reduce fashion with streaming api ?

I think I don't fully understand what you mean with map-reduce fashion. Do you mean multiple stages of map and reduce operations?
 
(3) Can flink cluster be auto scaled using EMR Managed Scaling when used with yarn as per this link https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/ ?

I am no expert on EMR managed scaling but I believe that it would need some custom tooling to scale a Flink job down (by taking a savepoint a resuming from it with a lower parallelism) before downsizing the EMR cluster.
 
(4) If we set an explicit max parallelism, and set current parallelism (which might be less than the max parallelism) equal to the maximum number of slots and set slots per task manager while starting the yarn session, then if we increase the task manager as per auto scaling then does the parallelism would increase (till the max parallelism ) and the load would be distributed across the newly spined up task manager ? Refer: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism 

At the moment, Flink does not support this out of the box but the community is working on this feature.

Regards 
Sidhant Gupta 

On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann, <[hidden email]> wrote:
Hi Sidhant,

I am not an expert on AWS services but I believe that EMR might be a bit easier to start with since AWS EMR comes with Flink support out of the box [1]. On ECS I believe that you would have to set up the containers yourself. Another interesting deployment option could be to use Flink's native Kubernetes integration [2] which would work on AWS EKS.


Cheers,
Till

On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta <[hidden email]> wrote:
Hi all,

I'm kind of new to flink cluster deployment. I wanted to know which flink
cluster deployment and which job mode in aws is better in terms of ease of
deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
vs ECS (docker containers). We have a usecase of setting up a data
streaming api which reads records from a Kafka topic, process it and then
write to a another Kafka topic. Please let me know your thoughts on this.

Thanks
Sidhant Gupta


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Flink cluster deployment strategy

sidhant gupta
Thanks, I will check it out. 

On Thu, 13 Aug, 2020, 7:55 PM Arvid Heise, <[hidden email]> wrote:
Hi Sidhant,

If you are starting fresh with Flink, I strongly recommend to skip ECS and EMR and directly go to a kubernetes-based solution. Scaling is much easier on K8s, there will be some kind of autoscaling coming in the next release, and the best of it all: you even have the option to go to a different cloud provider if needed.

The easiest option for you is to use EKS on AWS together with Ververica community edition [1] or with one of the many kubernetes operators.


On Tue, Aug 11, 2020 at 3:23 PM Till Rohrmann <[hidden email]> wrote:
Hi Sidhant,

see the inline comments for answers

On Tue, Aug 11, 2020 at 3:10 PM sidhant gupta <[hidden email]> wrote:
Hi Till,

Thanks for your response. 
I have few queries though as mentioned below:
(1) Can flink be used in map-reduce fashion with data streaming api ? 

What do you understand as map-reduce fashion? You can use Flink's DataSet API for processing batch workloads (consisting not only of map and reduce operations but also other operations such as groupReduce, flatMap, etc.). Flink's DataStream API can be used to process bounded and unbounded streaming data. 

(2) Does it make sense to use aws EMR if we are not using flink in map-reduce fashion with streaming api ?

I think I don't fully understand what you mean with map-reduce fashion. Do you mean multiple stages of map and reduce operations?
 
(3) Can flink cluster be auto scaled using EMR Managed Scaling when used with yarn as per this link https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/ ?

I am no expert on EMR managed scaling but I believe that it would need some custom tooling to scale a Flink job down (by taking a savepoint a resuming from it with a lower parallelism) before downsizing the EMR cluster.
 
(4) If we set an explicit max parallelism, and set current parallelism (which might be less than the max parallelism) equal to the maximum number of slots and set slots per task manager while starting the yarn session, then if we increase the task manager as per auto scaling then does the parallelism would increase (till the max parallelism ) and the load would be distributed across the newly spined up task manager ? Refer: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/production_ready.html#set-an-explicit-max-parallelism 

At the moment, Flink does not support this out of the box but the community is working on this feature.

Regards 
Sidhant Gupta 

On Tue, 11 Aug, 2020, 5:19 PM Till Rohrmann, <[hidden email]> wrote:
Hi Sidhant,

I am not an expert on AWS services but I believe that EMR might be a bit easier to start with since AWS EMR comes with Flink support out of the box [1]. On ECS I believe that you would have to set up the containers yourself. Another interesting deployment option could be to use Flink's native Kubernetes integration [2] which would work on AWS EKS.


Cheers,
Till

On Tue, Aug 11, 2020 at 9:16 AM sidhant gupta <[hidden email]> wrote:
Hi all,

I'm kind of new to flink cluster deployment. I wanted to know which flink
cluster deployment and which job mode in aws is better in terms of ease of
deployment, maintenance, HA, cost, etc. As of now I am considering aws EMR
vs ECS (docker containers). We have a usecase of setting up a data
streaming api which reads records from a Kafka topic, process it and then
write to a another Kafka topic. Please let me know your thoughts on this.

Thanks
Sidhant Gupta


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng