Re: checkpoint failure suddenly even state size less than 1 mb

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failure suddenly even state size less than 1 mb

Sushant Sawant
Hi team,
Anyone for help/suggestion, now we have stopped all input in kafka, there is no processing, no sink but checkpointing is failing. 
Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,
Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <[hidden email]> wrote:
Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
and  after restart, "everything is fine" images in this folder,

--
Could anyone point me towards direction what would have went wrong/ trouble shooting??


Thanks & Regards,
Sushant Sawant

Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failure suddenly even state size less than 1 mb

Yun Tang
Hi Sushant

What confuse me is that why source task cannot complete checkpoint in 3 minutes [1]. If no sub-task has ever completed the checkpoint, which means even source task cannot complete. Actually source task would not need to buffer the data. From what I see, it might be affected by acquiring the lock which hold by stream task main thread to process elements [2]. Could you use jstack to capture your java process' threads to know what happened when checkpoint failed?


Best
Yun Tang

From: Sushant Sawant <[hidden email]>
Sent: Tuesday, August 27, 2019 15:01
To: user <[hidden email]>
Subject: Re: checkpoint failure suddenly even state size less than 1 mb
 
Hi team,
Anyone for help/suggestion, now we have stopped all input in kafka, there is no processing, no sink but checkpointing is failing. 
Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,
Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <[hidden email]> wrote:
Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
and  after restart, "everything is fine" images in this folder,

--
Could anyone point me towards direction what would have went wrong/ trouble shooting??


Thanks & Regards,
Sushant Sawant

Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failure suddenly even state size less than 1 mb

Sushant Sawant
Hi Yun,
Have captured the heap dump which includes thread stack.
There is an lock in thread in elasticsearch sink operator.
Screenshot of Jprofiler
How could I over come this condition?


Thanks & Regards,
Sushant Sawant

On Fri, 30 Aug 2019, 12:48 Yun Tang, <[hidden email]> wrote:
Hi Sushant

What confuse me is that why source task cannot complete checkpoint in 3 minutes [1]. If no sub-task has ever completed the checkpoint, which means even source task cannot complete. Actually source task would not need to buffer the data. From what I see, it might be affected by acquiring the lock which hold by stream task main thread to process elements [2]. Could you use jstack to capture your java process' threads to know what happened when checkpoint failed?


Best
Yun Tang

From: Sushant Sawant <[hidden email]>
Sent: Tuesday, August 27, 2019 15:01
To: user <[hidden email]>
Subject: Re: checkpoint failure suddenly even state size less than 1 mb
 
Hi team,
Anyone for help/suggestion, now we have stopped all input in kafka, there is no processing, no sink but checkpointing is failing. 
Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,
Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <[hidden email]> wrote:
Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
and  after restart, "everything is fine" images in this folder,

--
Could anyone point me towards direction what would have went wrong/ trouble shooting??


Thanks & Regards,
Sushant Sawant