Hello there,
I am trying to find the solution for a problem we are having in our Flink setup related to very slow recovery from a Savepoint. I have searched in the mailing list, found a somewhat similar problem, the bottleneck there was the HD usage, but I am not seeing this in our case. Here is a description of what our setup is: * Flink 1.11.3 * Running on top of Kubernetes on dedicated hardware. * The Flink job consists of 4 task manager running on separate Kubernetes pods along with a Jobmanager also running on separate Pod. * We use RocksDB state backend with incremental checkpointing. * The size of the savepoint I try to recover is around 35 GB * The file system that RocksDB uses is S3, or more precisely a S3 emulation (Minio), we are not subject to any EBS burst credits and so on. The time it takes for the Flink job to be operational and start consuming new records is around 5 hours. During that time I am not seeing any heavy resource usage on any of the TaskManager pods. I am attaching a screenshot of the resources of one of the Taskmanager pods. <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png> In this graph the job was started at around 14:00 o'clock. There is this huge spike shortly after this and then there is not much happening. This goes on for around 5 hours after which the job starts, but again working quite slowly. What would be the way to profile where the bottleneck is? I have checked my network connectivity and I am able to download the whole savepoint for several minutes manually. It seems like Flink is very slow to build its internal state but then again the CPU is not being utilized. I would be grateful for any suggestions on how to proceed with this investigation. Regards, Yordan |
Hey Yordan, have you checked the log files from the processes in that cluster? The JobManager log should give you hints about issues with the coordination / scheduling of the job. Could it be something unexpected, like your job could not start, because there were not enough TaskManagers available? The TaskManager logs could give you also hints about potential retries etc. What you could also do is manually sample the TaskManagers (you can access thread dumps via the web ui) to see what they are doing. Hope this helps! On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <[hidden email]> wrote: Hello there, |
Great to hear that you were able to resolve the issue! On Thu, Feb 4, 2021 at 5:12 PM Yordan Pavlov <[hidden email]> wrote: Thank you for your tips Robert, |
Free forum by Nabble | Edit this page |