Need Help on Flink suitability to our usecase

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Need Help on Flink suitability to our usecase

Prasanna kumar
Hi,

I have the following usecase to implement in my organization. 

Say there is huge relational database(1000 tables for each of our 30k customers) in our monolith setup 

We want to reduce the load on the DB and prevent the applications from hitting it for latest events. So an extract is done from redo logs on to kafka.

We need to set up a streaming platform based on the table updates that happen(read from kafka) , we need to form events and send it consumer. 

Each consumer may be interested in same table but different updates/columns respective of their business needs and then deliver it to their endpoint/kinesis/SQS/a kafka topic. 

So the case here is 1 table update : m events : n sink. 
Peak Load expected is easily a  100k-million table updates per second(all customers put together)
Latency expected by most customers is less than a second. Mostly in 100-500ms.

Is this usecase suited for flink ? 

I went through the Flink book and documentation. These are the following questions i have 

1). If we have situation like this 1 table update : m events : n sink , is it better to write our micro service on our own or it it better to implement through flink.
      1 a)  How does checkpointing happens if we have 1 input: n output situations.
      1 b)  There are no heavy transformations maximum we might do is to check the required columns are present in the db updates and decide whether to create an event. So there is an alternative thought process to write a service in node since it more IO and less process. 

2)  I see that we are writing a Job and it is deployed and flink takes care of the rest in handling parallelism, latency and throughput. 
     But what i need is to write a generic framework so that we should be able to handle any table structure. we should not end up writing one job driver for each case.
    There are at least 200 type of events in the existing monolith system which might move to this new system once built.  
   
3)  How do we maintain flink cluster HA . From the book , i get that internal task level failures are handled gracefully in flink.  But what if the flink cluster goes down, how do we make sure its HA ?
    I had earlier worked with spark and we had issues managing it. (Not much problem was there since there the latency requirement is 15 min and we could make sure to ramp another one up within that time).
    These are absolute realtime cases and we cannot miss even one message/event. 

There are also thoughts whether to use kafka streams/apache storm for the same. [They are investigated by different set of folks]

Thanks,
Prasanna.
Reply | Threaded
Open this post in threaded view
|

Re: Need Help on Flink suitability to our usecase

rmetzger0
Hey Prasanna,

(Side note: there is not need to send this email to multiple mailing lists. The user@ list is the right one)

Let me quickly go through your questions:

Is this usecase suited for flink ? 

Based on the details you've provided: Yes
What you also need to consider are the hardware requirements you'll have for processing such amounts of data. I can strongly recommend setting up a small demo environment to measure the throughput of a smaller Flink cluster (say 10 machines).

1) If you do not have any consistency guarantees (data loss is acceptable), and you have good infrastructure in place to deploy and monitor such microservices then a microservice might also be an option.
Flink is pretty well suited for heavy IO use-cases. Afaik Netflix has talked at several Flink Forward conferences about similar cases (check Youtube for recorded talks)

2) It should not be too difficult to build a small, generic framework on top of the Flink APIs

3) If you are deploying Flink on a resource manager like Kubernetes or YARN, they will take care of recovering your cluster if it goes down. Your recovery time will mostly depend on the state size that you are checkpointing (and the ability of your resource manager to bring up new resources). I don't think you'll be able to recover in < 500 milliseconds, but within a few seconds.
I don't think that the other frameworks you are looking at are going to be much better at this.

Best,
Robert

On Tue, May 19, 2020 at 1:28 PM Prasanna kumar <[hidden email]> wrote:
Hi,

I have the following usecase to implement in my organization. 

Say there is huge relational database(1000 tables for each of our 30k customers) in our monolith setup 

We want to reduce the load on the DB and prevent the applications from hitting it for latest events. So an extract is done from redo logs on to kafka.

We need to set up a streaming platform based on the table updates that happen(read from kafka) , we need to form events and send it consumer. 

Each consumer may be interested in same table but different updates/columns respective of their business needs and then deliver it to their endpoint/kinesis/SQS/a kafka topic. 

So the case here is 1 table update : m events : n sink. 
Peak Load expected is easily a  100k-million table updates per second(all customers put together)
Latency expected by most customers is less than a second. Mostly in 100-500ms.

Is this usecase suited for flink ? 

I went through the Flink book and documentation. These are the following questions i have 

1). If we have situation like this 1 table update : m events : n sink , is it better to write our micro service on our own or it it better to implement through flink.
      1 a)  How does checkpointing happens if we have 1 input: n output situations.
      1 b)  There are no heavy transformations maximum we might do is to check the required columns are present in the db updates and decide whether to create an event. So there is an alternative thought process to write a service in node since it more IO and less process. 

2)  I see that we are writing a Job and it is deployed and flink takes care of the rest in handling parallelism, latency and throughput. 
     But what i need is to write a generic framework so that we should be able to handle any table structure. we should not end up writing one job driver for each case.
    There are at least 200 type of events in the existing monolith system which might move to this new system once built.  
   
3)  How do we maintain flink cluster HA . From the book , i get that internal task level failures are handled gracefully in flink.  But what if the flink cluster goes down, how do we make sure its HA ?
    I had earlier worked with spark and we had issues managing it. (Not much problem was there since there the latency requirement is 15 min and we could make sure to ramp another one up within that time).
    These are absolute realtime cases and we cannot miss even one message/event. 

There are also thoughts whether to use kafka streams/apache storm for the same. [They are investigated by different set of folks]

Thanks,
Prasanna.
Reply | Threaded
Open this post in threaded view
|

Re: Need Help on Flink suitability to our usecase

Prasanna kumar
Thanks Robert for the reply. 

On Fri 29 May, 2020, 12:31 Robert Metzger, <[hidden email]> wrote:
Hey Prasanna,

(Side note: there is not need to send this email to multiple mailing lists. The user@ list is the right one)

Let me quickly go through your questions:

Is this usecase suited for flink ? 

Based on the details you've provided: Yes
What you also need to consider are the hardware requirements you'll have for processing such amounts of data. I can strongly recommend setting up a small demo environment to measure the throughput of a smaller Flink cluster (say 10 machines).

1) If you do not have any consistency guarantees (data loss is acceptable), and you have good infrastructure in place to deploy and monitor such microservices then a microservice might also be an option.
Flink is pretty well suited for heavy IO use-cases. Afaik Netflix has talked at several Flink Forward conferences about similar cases (check Youtube for recorded talks)

2) It should not be too difficult to build a small, generic framework on top of the Flink APIs

3) If you are deploying Flink on a resource manager like Kubernetes or YARN, they will take care of recovering your cluster if it goes down. Your recovery time will mostly depend on the state size that you are checkpointing (and the ability of your resource manager to bring up new resources). I don't think you'll be able to recover in < 500 milliseconds, but within a few seconds.
I don't think that the other frameworks you are looking at are going to be much better at this.

Best,
Robert

On Tue, May 19, 2020 at 1:28 PM Prasanna kumar <[hidden email]> wrote:
Hi,

I have the following usecase to implement in my organization. 

Say there is huge relational database(1000 tables for each of our 30k customers) in our monolith setup 

We want to reduce the load on the DB and prevent the applications from hitting it for latest events. So an extract is done from redo logs on to kafka.

We need to set up a streaming platform based on the table updates that happen(read from kafka) , we need to form events and send it consumer. 

Each consumer may be interested in same table but different updates/columns respective of their business needs and then deliver it to their endpoint/kinesis/SQS/a kafka topic. 

So the case here is 1 table update : m events : n sink. 
Peak Load expected is easily a  100k-million table updates per second(all customers put together)
Latency expected by most customers is less than a second. Mostly in 100-500ms.

Is this usecase suited for flink ? 

I went through the Flink book and documentation. These are the following questions i have 

1). If we have situation like this 1 table update : m events : n sink , is it better to write our micro service on our own or it it better to implement through flink.
      1 a)  How does checkpointing happens if we have 1 input: n output situations.
      1 b)  There are no heavy transformations maximum we might do is to check the required columns are present in the db updates and decide whether to create an event. So there is an alternative thought process to write a service in node since it more IO and less process. 

2)  I see that we are writing a Job and it is deployed and flink takes care of the rest in handling parallelism, latency and throughput. 
     But what i need is to write a generic framework so that we should be able to handle any table structure. we should not end up writing one job driver for each case.
    There are at least 200 type of events in the existing monolith system which might move to this new system once built.  
   
3)  How do we maintain flink cluster HA . From the book , i get that internal task level failures are handled gracefully in flink.  But what if the flink cluster goes down, how do we make sure its HA ?
    I had earlier worked with spark and we had issues managing it. (Not much problem was there since there the latency requirement is 15 min and we could make sure to ramp another one up within that time).
    These are absolute realtime cases and we cannot miss even one message/event. 

There are also thoughts whether to use kafka streams/apache storm for the same. [They are investigated by different set of folks]

Thanks,
Prasanna.