(DEPRECATED) Apache Flink User Mailing List archive.

Re: checkpoint failure suddenly even state size less than 1 mb

Classic

List

Threaded

3 messages Options

Sushant Sawant

Re: checkpoint failure suddenly even state size less than 1 mb

Hi team,

Anyone for help/suggestion, now we have stopped all input in kafka, there is no processing, no sink but checkpointing is failing.

Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,

Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <[hidden email]> wrote:

Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and checkpoint failure
https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing
and after restart, "everything is fine" images in this folder,
https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing

--
Could anyone point me towards direction what would have went wrong/ trouble shooting??

Thanks & Regards,
Sushant Sawant

Yun Tang

Re: checkpoint failure suddenly even state size less than 1 mb

Hi Sushant

What confuse me is that why source task cannot complete checkpoint in 3 minutes [1]. If no sub-task has ever completed the checkpoint, which means even source task cannot complete. Actually source task would not need to buffer the data. From what I see, it might be affected by acquiring the lock which hold by stream task main thread to process elements [2]. Could you use jstack to capture your java process' threads to know what happened when checkpoint failed?

[1] https://github.com/sushantbprise/flink-dashboard/blob/master/failed-checkpointing/state2.png

[2] https://github.com/apache/flink/blob/ccc7eb431477059b32fb924104c17af953620c74/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L758

Best

Yun Tang

From: Sushant Sawant <[hidden email]>
Sent: Tuesday, August 27, 2019 15:01
To: user <[hidden email]>
Subject: Re: checkpoint failure suddenly even state size less than 1 mb

Hi team,

Anyone for help/suggestion, now we have stopped all input in kafka, there is no processing, no sink but checkpointing is failing.

Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,

Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <[hidden email]> wrote:

Hi all,
m facing two issues which I believe are co-related though.

1. Kafka source shows high back pressure.

2. Sudden checkpoint failure for entire day until restart.

My job does following thing,

a. Read from Kafka

b. Asyncio to external system

c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.

This flink job is proven under high load,

around 5000/sec throughput.

But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.

The github folder contains image where there was high back pressure and checkpoint failure

https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing

and after restart, "everything is fine" images in this folder,

https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing

--

Could anyone point me towards direction what would have went wrong/ trouble shooting??

Thanks & Regards,

Sushant Sawant

Sushant Sawant

Re: checkpoint failure suddenly even state size less than 1 mb

Hi Yun,

Have captured the heap dump which includes thread stack.

There is an lock in thread in elasticsearch sink operator.

Screenshot of Jprofiler

https://github.com/sushantbprise/flink-dashboard/tree/master/thread-blocked

And I see lot many threads in waiting state.

I found this link which is kind a similar to my problem.

http://mail-archives.apache.org/mod_mbox/flink-user/201710.mbox/%3cFDD0A701-D6FE-433F-B343-19FD24CB328A@...%3e

How could I over come this condition?

Thanks & Regards,

Sushant Sawant

On Fri, 30 Aug 2019, 12:48 Yun Tang, <[hidden email]> wrote:

Hi Sushant

What confuse me is that why source task cannot complete checkpoint in 3 minutes [1]. If no sub-task has ever completed the checkpoint, which means even source task cannot complete. Actually source task would not need to buffer the data. From what I see, it might be affected by acquiring the lock which hold by stream task main thread to process elements [2]. Could you use jstack to capture your java process' threads to know what happened when checkpoint failed?

[1] https://github.com/sushantbprise/flink-dashboard/blob/master/failed-checkpointing/state2.png

[2] https://github.com/apache/flink/blob/ccc7eb431477059b32fb924104c17af953620c74/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L758

Best

Yun Tang

From: Sushant Sawant <[hidden email]>
Sent: Tuesday, August 27, 2019 15:01
To: user <[hidden email]>
Subject: Re: checkpoint failure suddenly even state size less than 1 mb

Hi team,
Anyone for help/suggestion, now we have stopped all input in kafka, there is no processing, no sink but checkpointing is failing.

Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,

Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" <[hidden email]> wrote:

Hi all,
m facing two issues which I believe are co-related though.

1. Kafka source shows high back pressure.

2. Sudden checkpoint failure for entire day until restart.

My job does following thing,

a. Read from Kafka

b. Asyncio to external system

c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.

This flink job is proven under high load,

around 5000/sec throughput.

But recently we scaled down parallelism since, there wasn't any load in production and these issues started.

Please find the status shown by flink dashboard.

The github folder contains image where there was high back pressure and checkpoint failure

https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing

and after restart, "everything is fine" images in this folder,

https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing

--

Could anyone point me towards direction what would have went wrong/ trouble shooting??

Thanks & Regards,

Sushant Sawant