checkpoint failure suddenly even state size is into 10 mb around

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

checkpoint failure suddenly even state size is into 10 mb around

Sushant Sawant
Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
and  after restart, "everything is fine" images in this folder,

--
Could anyone point me towards direction what would have went wrong/ trouble shooting??


Thanks & Regards,
Sushant Sawant
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failure suddenly even state size is into 10 mb around

Biao Liu
Hi Sushant,

Your screenshot shows the checkpoint expired. It means checkpoint did not finish in time.
I guess the reason is the heavy back pressure blocks both data and barrier. But I can't tell why there was a heavy back pressure.

If this scenario happens again, you could pay more attention to the tasks which cause this heavy back pressure.
The task manager log, GC log, and some other tools like jstack might help.

Thanks,
Biao /'bɪ.aʊ/



On Fri, 23 Aug 2019 at 15:27, Sushant Sawant <[hidden email]> wrote:
Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
and  after restart, "everything is fine" images in this folder,

--
Could anyone point me towards direction what would have went wrong/ trouble shooting??


Thanks & Regards,
Sushant Sawant