Very slow recovery from Savepoint

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Very slow recovery from Savepoint

Yordan Pavlov
Hello there,
I am trying to find the solution for a problem we are having in our Flink
setup related to very slow recovery from a Savepoint. I have searched in the
mailing list, found a somewhat similar problem, the bottleneck there was the
HD usage, but I am not seeing this in our case. Here is a description of
what our setup is:
* Flink 1.11.3
* Running on top of Kubernetes on dedicated hardware.
* The Flink job consists of 4 task manager running on separate Kubernetes
pods along with a Jobmanager also running on separate Pod.
* We use RocksDB state backend with incremental checkpointing.
* The size of the savepoint I try to recover is around 35 GB
* The file system that RocksDB uses is S3, or more precisely a S3
emulation (Minio), we are not subject to any EBS burst credits and so
on.

The time it takes for the Flink job to be operational and start consuming
new records is around 5 hours. During that time I am not seeing any heavy
resource usage on any of the TaskManager pods. I am attaching a
screenshot of the resources of one of the Taskmanager pods.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png>

In this graph the job was started at around 14:00 o'clock. There is this
huge spike shortly after this and then there is not much happening. This
goes on for around 5 hours after which the job starts, but again working
quite slowly. What would be the way to profile where the bottleneck
is? I have checked my network connectivity and I am able to download
the whole savepoint for several minutes manually. It seems like Flink
is very slow to build its internal state but then again the CPU is not
being utilized. I would be grateful for any suggestions on how to
proceed with this investigation.

Regards,
Yordan
Reply | Threaded
Open this post in threaded view
|

Re: Very slow recovery from Savepoint

rmetzger0
Hey Yordan,

have you checked the log files from the processes in that cluster?
The JobManager log should give you hints about issues with the coordination / scheduling of the job. Could it be something unexpected, like your job could not start, because there were not enough TaskManagers available?
The TaskManager logs could give you also hints about potential retries etc.

What you could also do is manually sample the TaskManagers (you can access thread dumps via the web ui) to see what they are doing.

Hope this helps!

On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <[hidden email]> wrote:
Hello there,
I am trying to find the solution for a problem we are having in our Flink
setup related to very slow recovery from a Savepoint. I have searched in the
mailing list, found a somewhat similar problem, the bottleneck there was the
HD usage, but I am not seeing this in our case. Here is a description of
what our setup is:
* Flink 1.11.3
* Running on top of Kubernetes on dedicated hardware.
* The Flink job consists of 4 task manager running on separate Kubernetes
pods along with a Jobmanager also running on separate Pod.
* We use RocksDB state backend with incremental checkpointing.
* The size of the savepoint I try to recover is around 35 GB
* The file system that RocksDB uses is S3, or more precisely a S3
emulation (Minio), we are not subject to any EBS burst credits and so
on.

The time it takes for the Flink job to be operational and start consuming
new records is around 5 hours. During that time I am not seeing any heavy
resource usage on any of the TaskManager pods. I am attaching a
screenshot of the resources of one of the Taskmanager pods.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png>

In this graph the job was started at around 14:00 o'clock. There is this
huge spike shortly after this and then there is not much happening. This
goes on for around 5 hours after which the job starts, but again working
quite slowly. What would be the way to profile where the bottleneck
is? I have checked my network connectivity and I am able to download
the whole savepoint for several minutes manually. It seems like Flink
is very slow to build its internal state but then again the CPU is not
being utilized. I would be grateful for any suggestions on how to
proceed with this investigation.

Regards,
Yordan
Reply | Threaded
Open this post in threaded view
|

Re: Very slow recovery from Savepoint

rmetzger0
Great to hear that you were able to resolve the issue!

On Thu, Feb 4, 2021 at 5:12 PM Yordan Pavlov <[hidden email]> wrote:
Thank you for your tips Robert,
I think I narrowed down the problem to having slow Hard disks. Once
the memory runs out, RocksDb starts spilling to the disk and the
performance degradates greatly. I Moved the jobs to SSD disks and the
performance has been better.

Best regards!

On Tue, 2 Feb 2021 at 20:22, Robert Metzger <[hidden email]> wrote:
>
> Hey Yordan,
>
> have you checked the log files from the processes in that cluster?
> The JobManager log should give you hints about issues with the coordination / scheduling of the job. Could it be something unexpected, like your job could not start, because there were not enough TaskManagers available?
> The TaskManager logs could give you also hints about potential retries etc.
>
> What you could also do is manually sample the TaskManagers (you can access thread dumps via the web ui) to see what they are doing.
>
> Hope this helps!
>
> On Thu, Jan 28, 2021 at 5:42 PM Yordan Pavlov <[hidden email]> wrote:
>>
>> Hello there,
>> I am trying to find the solution for a problem we are having in our Flink
>> setup related to very slow recovery from a Savepoint. I have searched in the
>> mailing list, found a somewhat similar problem, the bottleneck there was the
>> HD usage, but I am not seeing this in our case. Here is a description of
>> what our setup is:
>> * Flink 1.11.3
>> * Running on top of Kubernetes on dedicated hardware.
>> * The Flink job consists of 4 task manager running on separate Kubernetes
>> pods along with a Jobmanager also running on separate Pod.
>> * We use RocksDB state backend with incremental checkpointing.
>> * The size of the savepoint I try to recover is around 35 GB
>> * The file system that RocksDB uses is S3, or more precisely a S3
>> emulation (Minio), we are not subject to any EBS burst credits and so
>> on.
>>
>> The time it takes for the Flink job to be operational and start consuming
>> new records is around 5 hours. During that time I am not seeing any heavy
>> resource usage on any of the TaskManager pods. I am attaching a
>> screenshot of the resources of one of the Taskmanager pods.
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2957/Flink-pod-start.png>
>>
>> In this graph the job was started at around 14:00 o'clock. There is this
>> huge spike shortly after this and then there is not much happening. This
>> goes on for around 5 hours after which the job starts, but again working
>> quite slowly. What would be the way to profile where the bottleneck
>> is? I have checked my network connectivity and I am able to download
>> the whole savepoint for several minutes manually. It seems like Flink
>> is very slow to build its internal state but then again the CPU is not
>> being utilized. I would be grateful for any suggestions on how to
>> proceed with this investigation.
>>
>> Regards,
>> Yordan