Trigger Time vs. Latest Acknowledgement

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Trigger Time vs. Latest Acknowledgement

Juho Autio
I'm triggering nightly savepoints at 23:59:00 with crontab on the flink cluster.

For example last night's savepoint has this information:

Trigger Time: 23:59:14
Latest Acknowledgement: 00:00:59

What are the min/max boundaries for the data contained by the savepoint? Can I deduce from this either of the following:

a) the savepoint cannot contain any data that was produced after 23:59:14
b) the savepoint cannot contain any data that was produced after 00:00:59

My use case is like this: if I restore the nightly savepoint, I want to be sure that any data that was produced during the current day will be included (+ some data from the previous day, that's ok). If the answer to above question is that (a) is false, but (b) holds, that would mean that I would need to trigger the savepoint early enough for it to complete before the midnight.

Something from the docs that doesn't seem to answer my question:

> Trigger Time: The time when the checkpoint was triggered at the JobManager.
> Latest Acknowledgement: The time when the latest acknowledged for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).

Reply | Threaded
Open this post in threaded view
|

Re: Trigger Time vs. Latest Acknowledgement

Aljoscha Krettek
Hi,

I think a) doesn't hold because there is no synchronisation between the CheckpointCoordinator and the sources doing the reading. I think b) will hold but it's also not exact because of clock differences between different machines and whatnot.

Best,
Aljoscha

On 29. Jan 2018, at 15:34, Juho Autio <[hidden email]> wrote:

I'm triggering nightly savepoints at 23:59:00 with crontab on the flink cluster.

For example last night's savepoint has this information:

Trigger Time: 23:59:14
Latest Acknowledgement: 00:00:59

What are the min/max boundaries for the data contained by the savepoint? Can I deduce from this either of the following:

a) the savepoint cannot contain any data that was produced after 23:59:14
b) the savepoint cannot contain any data that was produced after 00:00:59

My use case is like this: if I restore the nightly savepoint, I want to be sure that any data that was produced during the current day will be included (+ some data from the previous day, that's ok). If the answer to above question is that (a) is false, but (b) holds, that would mean that I would need to trigger the savepoint early enough for it to complete before the midnight.

Something from the docs that doesn't seem to answer my question:

> Trigger Time: The time when the checkpoint was triggered at the JobManager.
> Latest Acknowledgement: The time when the latest acknowledged for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).


Reply | Threaded
Open this post in threaded view
|

Re: Trigger Time vs. Latest Acknowledgement

Kostas Kloudas
Hi Juho,

I think that neither a) nor b) hold. 

The reported times are wall-clock times (or processing time in Flink terminology) when the checkpoint 
started and when it finished. 

What you want, if I understand correctly, is these times to reflect the event time of your pipeline. In other 
words, you want to say “Trigger my checkpoint so that it contains all data generated before 23:59”. 

Given that the skew between event and processing time is unpredictable, trying to provide guarantees 
about what is included in a checkpoint or savepoint is tricky, to say the least.

Trigger savepoints on event time is not supported. 

If the need for this nightly savepoints is fault-tolerance, then I would say that you do not need to have 
such strong guarantees on what is included in them. Flink will restart from where it left off at the moment
of the savepoint. 

If you want it for other purposes, then you may be able to structure your job differently to fit your needs.
But for this it would help if you shared a bit more information.

Thanks,
Kostas


On Jan 30, 2018, at 12:04 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,

I think a) doesn't hold because there is no synchronisation between the CheckpointCoordinator and the sources doing the reading. I think b) will hold but it's also not exact because of clock differences between different machines and whatnot.

Best,
Aljoscha

On 29. Jan 2018, at 15:34, Juho Autio <[hidden email]> wrote:

I'm triggering nightly savepoints at 23:59:00 with crontab on the flink cluster.

For example last night's savepoint has this information:

Trigger Time: 23:59:14
Latest Acknowledgement: 00:00:59

What are the min/max boundaries for the data contained by the savepoint? Can I deduce from this either of the following:

a) the savepoint cannot contain any data that was produced after 23:59:14
b) the savepoint cannot contain any data that was produced after 00:00:59

My use case is like this: if I restore the nightly savepoint, I want to be sure that any data that was produced during the current day will be included (+ some data from the previous day, that's ok). If the answer to above question is that (a) is false, but (b) holds, that would mean that I would need to trigger the savepoint early enough for it to complete before the midnight.

Something from the docs that doesn't seem to answer my question:

> Trigger Time: The time when the checkpoint was triggered at the JobManager.
> Latest Acknowledgement: The time when the latest acknowledged for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).