Re: Very slow recovery from Savepoint

Posted by rmetzger0 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Very-slow-recovery-from-Savepoint-tp41081p41301.html

Great to hear that you were able to resolve the issue!

On Thu, Feb 4, 2021 at 5:12 PM Yordan Pavlov <[hidden email]> wrote:
Thank you for your tips Robert,
I think I narrowed down the problem to having slow Hard disks. Once
the memory runs out, RocksDb starts spilling to the disk and the
performance degradates greatly. I Moved the jobs to SSD disks and the
performance has been better.

Best regards!

On Tue, 2 Feb 2021 at 20:22, Robert Metzger <[hidden email]> wrote:
>
> Hey Yordan,
>
> have you checked the log files from the processes in that cluster?
> The JobManager log should give you hints about issues with the coordination / scheduling of the job. Could it be something unexpected, like your job could not start, because there were not enough TaskManagers available?
> The TaskManager logs could give you also hints about potential retries etc.
>
> What you could also do is manually sample the TaskManagers (you can access thread dumps via the web ui) to see what they are doing.
>
> Hope this helps!
>
> On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <[hidden email]> wrote:
>>
>> Hello there,
>> I am trying to find the solution for a problem we are having in our Flink
>> setup related to very slow recovery from a Savepoint. I have searched in the
>> mailing list, found a somewhat similar problem, the bottleneck there was the
>> HD usage, but I am not seeing this in our case. Here is a description of
>> what our setup is:
>> * Flink 1.11.3
>> * Running on top of Kubernetes on dedicated hardware.
>> * The Flink job consists of 4 task manager running on separate Kubernetes
>> pods along with a Jobmanager also running on separate Pod.
>> * We use RocksDB state backend with incremental checkpointing.
>> * The size of the savepoint I try to recover is around 35 GB
>> * The file system that RocksDB uses is S3, or more precisely a S3
>> emulation (Minio), we are not subject to any EBS burst credits and so
>> on.
>>
>> The time it takes for the Flink job to be operational and start consuming
>> new records is around 5 hours. During that time I am not seeing any heavy
>> resource usage on any of the TaskManager pods. I am attaching a
>> screenshot of the resources of one of the Taskmanager pods.
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png>
>>
>> In this graph the job was started at around 14:00 o'clock. There is this
>> huge spike shortly after this and then there is not much happening. This
>> goes on for around 5 hours after which the job starts, but again working
>> quite slowly. What would be the way to profile where the bottleneck
>> is? I have checked my network connectivity and I am able to download
>> the whole savepoint for several minutes manually. It seems like Flink
>> is very slow to build its internal state but then again the CPU is not
>> being utilized. I would be grateful for any suggestions on how to
>> proceed with this investigation.
>>
>> Regards,
>> Yordan