You may want to look at using instances with local ssd drives. You don’t really need to keep the state data between instance stops and starts, because Flink will have to restore from a checkpoint or savepoint, so using ephemeral isn’t a problem.
Sent from my iPhone
Hello all,
I am trying to provision a Flink cluster on k8s. Some of the jobs in our existing cluster use RocksDB state backend. I wanted to take a look at the Flink helm chart or deployment manifests that provision task managers with dynamic PV and how they manage it. We are running on kops managed k8s cluster on AWS (!EKS). Also, some pointers on expected pain points, surprises, monitoring strategies would be really helpful.