Checkpointing not working

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpointing not working

yuvraj singh
Hi , 

I am doing checkpointing using s3 and rocksdb , 
i am doing checkpointing per 30 seconds and time out is 10 seconds .

most of the time its failing by saying Failure Time: 11:53:17Cause: Checkpoint expired before completing .
I  increases the timeout  as well still it not working for me .

please suggest .

Thanks 
Yubraj Singh 
Reply | Threaded
Open this post in threaded view
|

Re: Checkpointing not working

Jörn Franke
What do the logfiles say?

How does the source code looks like?

Is it really needed to do checkpointing every 30 seconds?

On 19. Sep 2018, at 08:25, yuvraj singh <[hidden email]> wrote:

Hi , 

I am doing checkpointing using s3 and rocksdb , 
i am doing checkpointing per 30 seconds and time out is 10 seconds .

most of the time its failing by saying Failure Time: 11:53:17Cause: Checkpoint expired before completing .
I  increases the timeout  as well still it not working for me .

please suggest .

Thanks 
Yubraj Singh 
Reply | Threaded
Open this post in threaded view
|

***UNCHECKED*** Re: Checkpointing not working

yuvraj singh
log :: Checkpoint 58 of job 0efaa0e6db5c38bec81dfefb159402c0 expired before completing. 
I have a use case where i need to do the checkpointing frequently . 

i am using Kafka to read stream and making a window of 1 hour ,  which is having 50gb data always  and it can be more than that . 

i have seen there is no back pressure . 

Thanks 
Yubraj Singh 



On Wed, Sep 19, 2018 at 12:07 PM Jörn Franke <[hidden email]> wrote:
What do the logfiles say?

How does the source code looks like?

Is it really needed to do checkpointing every 30 seconds?

On 19. Sep 2018, at 08:25, yuvraj singh <[hidden email]> wrote:

Hi , 

I am doing checkpointing using s3 and rocksdb , 
i am doing checkpointing per 30 seconds and time out is 10 seconds .

most of the time its failing by saying Failure Time: 11:53:17Cause: Checkpoint expired before completing .
I  increases the timeout  as well still it not working for me .

please suggest .

Thanks 
Yubraj Singh 
Reply | Threaded
Open this post in threaded view
|

Re: ***UNCHECKED*** Re: Checkpointing not working

Vijay Bhaskar
Can you please check the following document and verify whether you have enough network bandwidth to support 30 seconds check point interval worth of the streaming data?

Regards
Bhaskar

On Wed, Sep 19, 2018 at 12:21 PM yuvraj singh <[hidden email]> wrote:
log :: Checkpoint 58 of job 0efaa0e6db5c38bec81dfefb159402c0 expired before completing. 
I have a use case where i need to do the checkpointing frequently . 

i am using Kafka to read stream and making a window of 1 hour ,  which is having 50gb data always  and it can be more than that . 

i have seen there is no back pressure . 

Thanks 
Yubraj Singh 



On Wed, Sep 19, 2018 at 12:07 PM Jörn Franke <[hidden email]> wrote:
What do the logfiles say?

How does the source code looks like?

Is it really needed to do checkpointing every 30 seconds?

On 19. Sep 2018, at 08:25, yuvraj singh <[hidden email]> wrote:

Hi , 

I am doing checkpointing using s3 and rocksdb , 
i am doing checkpointing per 30 seconds and time out is 10 seconds .

most of the time its failing by saying Failure Time: 11:53:17Cause: Checkpoint expired before completing .
I  increases the timeout  as well still it not working for me .

please suggest .

Thanks 
Yubraj Singh 
Reply | Threaded
Open this post in threaded view
|

Re: Checkpointing not working

vino yang
In reply to this post by Jörn Franke
Hi Yubraj,

Can you set your log print level to DEBUG and share it with us or share a screenshot of your Flink web UI checkpoint information?

Thanks, vino.

Jörn Franke <[hidden email]> 于2018年9月19日周三 下午2:37写道:
What do the logfiles say?

How does the source code looks like?

Is it really needed to do checkpointing every 30 seconds?

On 19. Sep 2018, at 08:25, yuvraj singh <[hidden email]> wrote:

Hi , 

I am doing checkpointing using s3 and rocksdb , 
i am doing checkpointing per 30 seconds and time out is 10 seconds .

most of the time its failing by saying Failure Time: 11:53:17Cause: Checkpoint expired before completing .
I  increases the timeout  as well still it not working for me .

please suggest .

Thanks 
Yubraj Singh 
Reply | Threaded
Open this post in threaded view
|

Re: Checkpointing not working

Stefan Richter
Hi,

in the absence of any logs, my guess would be that your checkpoints are just not able to complete within 10 seconds, the state might be to large or the network and fs to slow. Are you using full or incremental checkpoints? For your relative small interval, I suggest that you try using incremental checkpoints. Still thinking that your timeout and interval is pretty ambitious.

Best,
Stefan

Am 20.09.2018 um 10:17 schrieb vino yang <[hidden email]>:

Hi Yubraj,

Can you set your log print level to DEBUG and share it with us or share a screenshot of your Flink web UI checkpoint information?

Thanks, vino.

Jörn Franke <[hidden email]> 于2018年9月19日周三 下午2:37写道:
What do the logfiles say?

How does the source code looks like?

Is it really needed to do checkpointing every 30 seconds?

On 19. Sep 2018, at 08:25, yuvraj singh <[hidden email]> wrote:

Hi , 

I am doing checkpointing using s3 and rocksdb , 
i am doing checkpointing per 30 seconds and time out is 10 seconds .

most of the time its failing by saying Failure Time: 11:53:17Cause: Checkpoint expired before completing .
I  increases the timeout  as well still it not working for me .

please suggest .

Thanks 
Yubraj Singh