Flink: Clarification required

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink: Clarification required

Jessy Ping

Hi all,


Currently, we are exploring the various features of Flink and need some clarification on the below-mentioned questions.

  • I have a stateless Flink application where the source and sink are two different Kafka topics. Is there any benefit in adding checkpointing for this application?. will it help in some way for the rewind and replays while restarting from the failure?

  • I have a stateful use case where events are processed based on a set of dynamic rules provided by an external system, say a Kafka source. Also, the actual events are distinguishable based on a key.A broadcast function is used for broadcasting the dynamic rules and storing the same in Flink state. 

    So my question is, processing the incoming streams based on these rules stored in Flink state per key is efficient or not ( i am using rocksdb as state-backend ) ?

    What about using an external cache for this?

    Is stateful function a good contender here? 

  •  Is there any benefit in using Apache camel along with Flink ?


Thanks
Jessy

Reply | Threaded
Open this post in threaded view
|

Re: Flink: Clarification required

Dawid Wysakowicz-2

Hi Jessy,

I have a stateless Flink application where the source and sink are two different Kafka topics. Is there any benefit in adding checkpointing for this application?. will it help in some way for the rewind and replays while restarting from the failure?

If you do want to make sure that your application has either AT_LEAST_ONCE or EXACTLY_ONCE semantic[1] you need to enable the checkpointing. Flink needs to keep track of the offsets which it stores in its state to achieve those. Therefore even though your transformations do not have state themselves, the source does have a state.

So my question is, processing the incoming streams based on these rules stored in Flink state per key is efficient or not ( i am using rocksdb as state-backend ) ?

There is no one good answer for that question. It varies a lot depending on the volume of data etc. I'd recommend checking it yourselves if you are happy with the performance. That's definitely an approach a lot of people implemented and were happy with it. There is also a blog post (rather oldish by now) which describes how you could implement such pattern[2]

Is there any benefit in using Apache camel along with Flink ?

I am not very familiar with Apache Camel so can't say much on this. As far as I know Apache Camel is more of a routing system, whereas Flink is a data processing framework.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/#kafka-producers-and-fault-tolerance

[2] https://flink.apache.org/2019/06/26/broadcast-state.html

On 10/05/2021 11:13, Jessy Ping wrote:

Hi all,


Currently, we are exploring the various features of Flink and need some clarification on the below-mentioned questions.

  • I have a stateless Flink application where the source and sink are two different Kafka topics. Is there any benefit in adding checkpointing for this application?. will it help in some way for the rewind and replays while restarting from the failure?

  • I have a stateful use case where events are processed based on a set of dynamic rules provided by an external system, say a Kafka source. Also, the actual events are distinguishable based on a key.A broadcast function is used for broadcasting the dynamic rules and storing the same in Flink state. 

    So my question is, processing the incoming streams based on these rules stored in Flink state per key is efficient or not ( i am using rocksdb as state-backend ) ?

    What about using an external cache for this?

    Is stateful function a good contender here? 

  •  Is there any benefit in using Apache camel along with Flink ?


Thanks
Jessy

OpenPGP_signature (855 bytes) Download Attachment