(DEPRECATED) Apache Flink User Mailing List archive.

Re: Very slow recovery from Savepoint

Posted by rmetzger0 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Very-slow-recovery-from-Savepoint-tp41081p41204.html

Hey Yordan,

have you checked the log files from the processes in that cluster?
The JobManager log should give you hints about issues with the coordination / scheduling of the job. Could it be something unexpected, like your job could not start, because there were not enough TaskManagers available?

The TaskManager logs could give you also hints about potential retries etc.

What you could also do is manually sample the TaskManagers (you can access thread dumps via the web ui) to see what they are doing.

Hope this helps!

On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <[hidden email]> wrote:

Hello there,
I am trying to find the solution for a problem we are having in our Flink
setup related to very slow recovery from a Savepoint. I have searched in the
mailing list, found a somewhat similar problem, the bottleneck there was the
HD usage, but I am not seeing this in our case. Here is a description of
what our setup is:
* Flink 1.11.3
* Running on top of Kubernetes on dedicated hardware.
* The Flink job consists of 4 task manager running on separate Kubernetes
pods along with a Jobmanager also running on separate Pod.
* We use RocksDB state backend with incremental checkpointing.
* The size of the savepoint I try to recover is around 35 GB
* The file system that RocksDB uses is S3, or more precisely a S3
emulation (Minio), we are not subject to any EBS burst credits and so
on.

The time it takes for the Flink job to be operational and start consuming
new records is around 5 hours. During that time I am not seeing any heavy
resource usage on any of the TaskManager pods. I am attaching a
screenshot of the resources of one of the Taskmanager pods.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png>

In this graph the job was started at around 14:00 o'clock. There is this
huge spike shortly after this and then there is not much happening. This
goes on for around 5 hours after which the job starts, but again working
quite slowly. What would be the way to profile where the bottleneck
is? I have checked my network connectivity and I am able to download
the whole savepoint for several minutes manually. It seems like Flink
is very slow to build its internal state but then again the CPU is not
being utilized. I would be grateful for any suggestions on how to
proceed with this investigation.

Regards,
Yordan