StateFun scalability

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

StateFun scalability

Martijn de Heus
Hi all,

I’ve been working with StateFun for a bit for my university project. I am now trying to increase the number of StateFun workers and the parallelism, however this barely seems to increase the throughput of my system. 

I have 5000 function instances in my system during my tests. Once I increase the workers from 1 to 3 I notice a significant increase in throughput, however from 3 to 5 (or even to 7) I notice no increase. I run all workers with 4 CPUs and made sure that Kafka and my deployed colocated functions are not causing any bottlenecks. I also have many partitions for the ingress topics.

I attached my flink-conf.yaml below. Is this expected behaviour for StateFun or am I missing some configuration which can improve my performance. Also if this is expected for StateFun, what could be causing this?

Best regards,

Martijn


jobmanager.rpc.address: statefun-master
taskmanager.numberOfTaskSlots: 1
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
state.checkpoints.dir: file:///checkpoint-dir
state.backend: rocksdb
state.backend.rocksdb.timer-service.factory: ROCKSDB
state.backend.incremental: true
execution.checkpointing.interval: 10sec
execution.checkpointing.mode: EXACTLY_ONCE
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2147483647
restart-strategy.fixed-delay.delay: 1sec
jobmanager.memory.process.size: 1g
taskmanager.memory.process.size: 1g
parallelism.default: 5
Reply | Threaded
Open this post in threaded view
|

Re: StateFun scalability

Igal Shilman
Hello Martijn,

Great to hear that you are exploring StateFun as part of your university project!

Can you please clarify:
- how do you measure throughput?
- by co-located functions, do you mean a remote function on the same machine?
- Can you share a little bit more about your functions, what are they doing?
- Do you use any kind of state?
- What kind of messages do you send? are you using Protobuf for messages or something else?

Can you validate your setup vs a vanilla Flink program (something like a wordcount)

Thanks,
Igal


On Thu, Feb 4, 2021 at 9:51 PM Martijn de Heus <[hidden email]> wrote:
Hi all,

I’ve been working with StateFun for a bit for my university project. I am now trying to increase the number of StateFun workers and the parallelism, however this barely seems to increase the throughput of my system. 

I have 5000 function instances in my system during my tests. Once I increase the workers from 1 to 3 I notice a significant increase in throughput, however from 3 to 5 (or even to 7) I notice no increase. I run all workers with 4 CPUs and made sure that Kafka and my deployed colocated functions are not causing any bottlenecks. I also have many partitions for the ingress topics.

I attached my flink-conf.yaml below. Is this expected behaviour for StateFun or am I missing some configuration which can improve my performance. Also if this is expected for StateFun, what could be causing this?

Best regards,

Martijn


jobmanager.rpc.address: statefun-master
taskmanager.numberOfTaskSlots: 1
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
state.checkpoints.dir: file:///checkpoint-dir
state.backend: rocksdb
state.backend.rocksdb.timer-service.factory: ROCKSDB
state.backend.incremental: true
execution.checkpointing.interval: 10sec
execution.checkpointing.mode: EXACTLY_ONCE
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2147483647
restart-strategy.fixed-delay.delay: 1sec
jobmanager.memory.process.size: 1g
taskmanager.memory.process.size: 1g
parallelism.default: 5