(DEPRECATED) Apache Flink User Mailing List archive.

checkpoint failure suddenly even state size is into 10 mb around

Classic

List

Threaded

2 messages Options

Sushant Sawant

checkpoint failure suddenly even state size is into 10 mb around

Hi all,

m facing two issues which I believe are co-related though.

1. Kafka source shows high back pressure.

2. Sudden checkpoint failure for entire day until restart.

My job does following thing,

a. Read from Kafka

b. Asyncio to external system

c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.

This flink job is proven under high load,

around 5000/sec throughput.

But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.

The github folder contains image where there was high back pressure and checkpoint failure

https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing

and after restart, "everything is fine" images in this folder,

https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing

Could anyone point me towards direction what would have went wrong/ trouble shooting??

Thanks & Regards,

Sushant Sawant

Biao Liu

Re: checkpoint failure suddenly even state size is into 10 mb around

Hi Sushant,

Your screenshot shows the checkpoint expired. It means checkpoint did not finish in time.

I guess the reason is the heavy back pressure blocks both data and barrier. But I can't tell why there was a heavy back pressure.

If this scenario happens again, you could pay more attention to the tasks which cause this heavy back pressure.

The task manager log, GC log, and some other tools like jstack might help.

Thanks,

Biao /'bɪ.aʊ/

On Fri, 23 Aug 2019 at 15:27, Sushant Sawant <[hidden email]> wrote:

Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing
and after restart, "everything is fine" images in this folder,
https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing

--
Could anyone point me towards direction what would have went wrong/ trouble shooting??

Thanks & Regards,
Sushant Sawant