(DEPRECATED) Apache Flink User Mailing List archive.

Regarding Stateful Functions

Classic

List

Threaded

3 messages Options

Jessy Ping

Regarding Stateful Functions

Hi all,

I have gone through the stateful function's documentation and required some expert advice or clarification regarding the following points.

Note: My data processing flow is as follows,

ingress(10k/s)--> First transformation based on certain static rules --> second transformation based on certain dynamic rules --> Third and final transformation based on certain dynamic and static rules --> egress

Questions

1. Is the stateful function a good candidate for a system(as above) that should process incoming requests at the rate of 10K/s depending on various dynamic rules and static rules?

2. Is Flink capable of accommodating the above-mentioned dynamic rules in its states (about 1500 rules per Keyed Event ) for the faster transformation of incoming streams?

3. If we are not interested in using AWS lambda or Azure functions, what are the other options?. What about using co-located functions and embedded functions? Is there any benefit in using one over the other for my data processing flow?

4.If we are going with embedded functions/co-located functions, is it possible to autoscale the application using the recently released reactive mode in Flink 1.13?

Thanks

Jessy

austin.ce

Re: Regarding Stateful Functions

Hey Jessy,

I'm not a Statefun expert but, hopefully, I can point you in the right direction for some of your questions. I'll also cc Gordan, who helps to maintain Statefun.

1. Is the stateful function a good candidate for a system(as above) that should process incoming requests at the rate of 10K/s depending on various dynamic rules and static rules?

The scale is definitely manageable in a Statefun cluster, and could possibly be a good fit for dynamic and static rules. Hopefully Gordon can comment more there. For the general Flink solution to this problem, I always turn to this great series of blog posts around fraud detection with dynamic rules[1].

2. Is Flink capable of accommodating the above-mentioned dynamic rules in its states (about 1500 rules per Keyed Event ) for the faster transformation of incoming streams?

This may be manageable as well, depending on how you are applying these rules and what they look like (size, etc.). Can you give any more information there?

3. If we are not interested in using AWS lambda or Azure functions, what are the other options?. What about using co-located functions and embedded functions? Is there any benefit in using one over the other for my data processing flow?

Yes, you can embed JVM functions via Embedded Modules[2], which in your case might benefit from the Flink DataStream integration[3]. You can also host remote functions anywhere, i.e. Kubernetes, behind an NGINX server, etc. The Module Configuration section[4] will likely shed more light on what is available. I think the main tradeoffs here are availability, scalability, and network latency for external functions.

4.If we are going with embedded functions/co-located functions, is it possible to autoscale the application using the recently released reactive mode in Flink 1.13?

Statefun 3.0 uses Flink 1.12 but is expected to upgrade to Flink 1.13 in the next release cycle. There are a few other changes that are necessary to be compatible with Reactive Mode (i.e make the Statefun Cluster a regular Flink Application tracked in FLINK-16930 [5]), but it's coming!

On a higher note, what made you interested in Statefun for this use case? The community is currently trying to expand our understanding of potential users, so it would be great to hear a bit more!

Best,

Austin

[1]: https://flink.apache.org/news/2020/01/15/demo-fraud-detection.html

[2]: https://ci.apache.org/projects/flink/flink-statefun-docs-release-3.0/docs/deployment/embedded/#embedded-module-configuration

[3]: https://ci.apache.org/projects/flink/flink-statefun-docs-release-3.0/docs/sdk/flink-datastream/

[4]: https://ci.apache.org/projects/flink/flink-statefun-docs-release-3.0/docs/deployment/module/#module-configuration

[5]: https://issues.apache.org/jira/browse/FLINK-16930

On Wed, May 12, 2021 at 11:53 AM Jessy Ping <[hidden email]> wrote:

Hi all,

I have gone through the stateful function's documentation and required some expert advice or clarification regarding the following points.

Note: My data processing flow is as follows,

ingress(10k/s)--> First transformation based on certain static rules --> second transformation based on certain dynamic rules --> Third and final transformation based on certain dynamic and static rules --> egress

Questions
1. Is the stateful function a good candidate for a system(as above) that should process incoming requests at the rate of 10K/s depending on various dynamic rules and static rules?

2. Is Flink capable of accommodating the above-mentioned dynamic rules in its states (about 1500 rules per Keyed Event ) for the faster transformation of incoming streams?

3. If we are not interested in using AWS lambda or Azure functions, what are the other options?. What about using co-located functions and embedded functions? Is there any benefit in using one over the other for my data processing flow?

4.If we are going with embedded functions/co-located functions, is it possible to autoscale the application using the recently released reactive mode in Flink 1.13?

Thanks
Jessy

Jessy Ping

Re: Regarding Stateful Functions

Hi Austin,

Thanks for your insights.

We are currently following a microservice architecture for accomplishing our data processing requirements. We are planning to use Flink as our unified platform for all data processing tasks. Although most of our use cases are a suitable fit for Flink, there is one use case that needs some extra deep dive into the capabilities of Flink.

As I mentioned in my previous email, the processing flow of the use case in discussion is as follows,

ingress(>=10k/s)--> First transformation based on certain static rules --> second transformation based on certain dynamic rules --> Third and final transformation based on certain dynamic and static rules --> egress

In our current design, we are using a microservice embedded Hazelcast cluster. It's a complex system with several stability issues. We are looking for an alternative solution based on open sources, and it seems like the stateful function powered by Flink is an ideal candidate. The following features of 'Stateful Functions' attracted us,

1. Consistent State.

2. No Database Required

3. Exactly once semantics.

4. Logical Addressing

5. Multi-language support.

Any additional insights in the already mentioned questions are helpful.

Thanks

Jessy

On Thu, 13 May 2021 at 04:25, Austin Cawley-Edwards <[hidden email]> wrote:

Hey Jessy,

I'm not a Statefun expert but, hopefully, I can point you in the right direction for some of your questions. I'll also cc Gordan, who helps to maintain Statefun.

1. Is the stateful function a good candidate for a system(as above) that should process incoming requests at the rate of 10K/s depending on various dynamic rules and static rules?

The scale is definitely manageable in a Statefun cluster, and could possibly be a good fit for dynamic and static rules. Hopefully Gordon can comment more there. For the general Flink solution to this problem, I always turn to this great series of blog posts around fraud detection with dynamic rules[1].

2. Is Flink capable of accommodating the above-mentioned dynamic rules in its states (about 1500 rules per Keyed Event ) for the faster transformation of incoming streams?

This may be manageable as well, depending on how you are applying these rules and what they look like (size, etc.). Can you give any more information there?

3. If we are not interested in using AWS lambda or Azure functions, what are the other options?. What about using co-located functions and embedded functions? Is there any benefit in using one over the other for my data processing flow?

Yes, you can embed JVM functions via Embedded Modules[2], which in your case might benefit from the Flink DataStream integration[3]. You can also host remote functions anywhere, i.e. Kubernetes, behind an NGINX server, etc. The Module Configuration section[4] will likely shed more light on what is available. I think the main tradeoffs here are availability, scalability, and network latency for external functions.

4.If we are going with embedded functions/co-located functions, is it possible to autoscale the application using the recently released reactive mode in Flink 1.13?

Statefun 3.0 uses Flink 1.12 but is expected to upgrade to Flink 1.13 in the next release cycle. There are a few other changes that are necessary to be compatible with Reactive Mode (i.e make the Statefun Cluster a regular Flink Application tracked in FLINK-16930 [5]), but it's coming!

On a higher note, what made you interested in Statefun for this use case? The community is currently trying to expand our understanding of potential users, so it would be great to hear a bit more!

Best,
Austin

[1]: https://flink.apache.org/news/2020/01/15/demo-fraud-detection.html
[2]: https://ci.apache.org/projects/flink/flink-statefun-docs-release-3.0/docs/deployment/embedded/#embedded-module-configuration
[3]: https://ci.apache.org/projects/flink/flink-statefun-docs-release-3.0/docs/sdk/flink-datastream/
[4]: https://ci.apache.org/projects/flink/flink-statefun-docs-release-3.0/docs/deployment/module/#module-configuration
[5]: https://issues.apache.org/jira/browse/FLINK-16930

On Wed, May 12, 2021 at 11:53 AM Jessy Ping <[hidden email]> wrote:
Hi all,

I have gone through the stateful function's documentation and required some expert advice or clarification regarding the following points.

Note: My data processing flow is as follows,

ingress(10k/s)--> First transformation based on certain static rules --> second transformation based on certain dynamic rules --> Third and final transformation based on certain dynamic and static rules --> egress

Questions
1. Is the stateful function a good candidate for a system(as above) that should process incoming requests at the rate of 10K/s depending on various dynamic rules and static rules?

2. Is Flink capable of accommodating the above-mentioned dynamic rules in its states (about 1500 rules per Keyed Event ) for the faster transformation of incoming streams?

3. If we are not interested in using AWS lambda or Azure functions, what are the other options?. What about using co-located functions and embedded functions? Is there any benefit in using one over the other for my data processing flow?

4.If we are going with embedded functions/co-located functions, is it possible to autoscale the application using the recently released reactive mode in Flink 1.13?

Thanks
Jessy