(DEPRECATED) Apache Flink User Mailing List archive.

Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

Classic

List

Threaded

4 messages Options

min.tan

Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

Hi,

I have a simple data pipeline of a Kafka source, a flink map operator and a Kafka sink.

I have a quick question about latency caused by the checkpoint on the exactly once mode.

Due to the changes are committed and visible on a checkpoint completion, so the latency could be as long as that length of checkpoint interval e.g. 5seconds?

Is my understanding correct?

If I use the at least mode, there will be this addition on latency. More interestingly, the flink document https://ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html indicate that "dataflows with only embarrassingly parallel streaming operations (map(), flatMap(), filter(), …) actually give exactly once guarantees even in at least once mode."

Unfortunately, I have been not able to achieve the exactly once with the at least once. Do I need more settings than I have with the exactly once mode?

Many thanks for the advises in advance.

Min

smime.p7s (10K) Download Attachment

Guowei Ma

Re: Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

If your implementation only commits your changing after the complete of a checkpoint I think the latency of e2e is at least the interval of checkpoint.

I think the document wants to say that a topology, which only has flatmap/filter/map(no task has more than one input) could achieve the exactly once semantics even in at least mode since the effect of barrier alignments in at least mode is same as in exactly once mode by coincidence for such topology.

I think there might be some benefits if you could set the parallelism of source/sink/flatmap to the same parallelism(there could exist other way) in some situation since during the alignments the task, which has many inputs would not deal with the elements behind the barrier in exactly mode until the barriers of all inputs arrive. (If your checkpoint interval is very very long I think there would be no difference).

Best

Guowei

<[hidden email]>于2019年4月7日周日上午3:14写道：

Hi,

I have a simple data pipeline of a Kafka source, a flink map operator and a Kafka sink.

I have a quick question about latency caused by the checkpoint on the exactly once mode.

Due to the changes are committed and visible on a checkpoint completion, so the latency could be as long as that length of checkpoint interval e.g. 5seconds?

Is my understanding correct?

If I use the at least mode, there will be this addition on latency. More interestingly, the flink document https://ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html indicate that "dataflows with only embarrassingly parallel streaming operations (map(), flatMap(), filter(), …) actually give exactly once guarantees even in at least once mode."

Unfortunately, I have been not able to achieve the exactly once with the at least once. Do I need more settings than I have with the exactly once mode?

Many thanks for the advises in advance.

Min

Best,

Guowei

Fabian Hueske-2

Re: Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

Hi Min,

Guowei is right, the comment in the documentation about exactly-once in embarrassingly parallel data flows refers to exactly-once *state consistency*, not *end-to-end* exactly-once.

However, in strictly forwarding pipelines, enabling exactly-once checkpoints should not have drawbacks compared to at-least-once since there won't be any barrier alignment anyway. I'd simple enable exactly-once.

If you enable end-to-end exactly-once for a Kafka sink, the latency will be at least the checkpointing interval (maybe even more, depending on how you configure Flink's checkpointing mechanism and event-time / watermarks).

Best,

Fabian

Am So., 7. Apr. 2019 um 08:41 Uhr schrieb Guowei Ma <[hidden email]>:

If your implementation only commits your changing after the complete of a checkpoint I think the latency of e2e is at least the interval of checkpoint.

I think the document wants to say that a topology, which only has flatmap/filter/map(no task has more than one input) could achieve the exactly once semantics even in at least mode since the effect of barrier alignments in at least mode is same as in exactly once mode by coincidence for such topology.

I think there might be some benefits if you could set the parallelism of source/sink/flatmap to the same parallelism(there could exist other way) in some situation since during the alignments the task, which has many inputs would not deal with the elements behind the barrier in exactly mode until the barriers of all inputs arrive. (If your checkpoint interval is very very long I think there would be no difference).

Best
Guowei

<[hidden email]>于2019年4月7日周日上午3:14写道：
Hi,

I have a simple data pipeline of a Kafka source, a flink map operator and a Kafka sink.

I have a quick question about latency caused by the checkpoint on the exactly once mode.

Due to the changes are committed and visible on a checkpoint completion, so the latency could be as long as that length of checkpoint interval e.g. 5seconds?

Is my understanding correct?

If I use the at least mode, there will be this addition on latency. More interestingly, the flink document https://ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html indicate that "dataflows with only embarrassingly parallel streaming operations (map(), flatMap(), filter(), …) actually give exactly once guarantees even in at least once mode."

Unfortunately, I have been not able to achieve the exactly once with the at least once. Do I need more settings than I have with the exactly once mode?

Many thanks for the advises in advance.

Min

--
Best,
Guowei

min.tan

RE: Re: Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

In reply to this post by Guowei Ma

Many thanks for your quick reply.

1) My implementation has no commits. All commits are done in FlinkKafkaProducer class I envisage.

KeyedSerializationSchemaWrapper keyedSerializationSchemaWrapper = new KeyedSerializationSchemaWrapper(new SimpleStringSchema());

new FlinkKafkaProducer<String>("test.out", keyedSerializationSchemaWrapper,KafkaProperities.getProperties(env), FlinkKafkaProducer.Semantic.EXACTLY_ONCE);

If the latency could be as long as the interval of checkpoint, it would be not ideal for a long interval setting e.g. a few minutes

2) My parallelism is set on the job level, I would expect they all have the same parallelism for each source, operator and sink. Actually, my test code only has one kafka source, one map and one kafka sink. It has produced duplication in a restart if I use the at least once mode.

Regards,

Min

From: Guowei Ma [mailto:[hidden email]]
Sent: Sonntag, 7. April 2019 08:42
To: Tan, Min
Cc: [hidden email]
Subject: [External] Re: Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

If your implementation only commits your changing after the complete of a checkpoint I think the latency of e2e is at least the interval of checkpoint.

Best

Guowei

<[hidden email]>于2019年4月7日周日上午3:14写道：

Hi,

I have a simple data pipeline of a Kafka source, a flink map operator and a Kafka sink.

I have a quick question about latency caused by the checkpoint on the exactly once mode.

Due to the changes are committed and visible on a checkpoint completion, so the latency could be as long as that length of checkpoint interval e.g. 5seconds?

Is my understanding correct?

If I use the at least mode, there will be this addition on latency. More interestingly, the flink document https://ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html indicate that "dataflows with only embarrassingly parallel streaming operations (map(), flatMap(), filter(), …) actually give exactly once guarantees even in at least once mode."

Unfortunately, I have been not able to achieve the exactly once with the at least once. Do I need more settings than I have with the exactly once mode?

Many thanks for the advises in advance.

Min

Best,

Guowei

E-mails can involve SUBSTANTIAL RISKS, e.g. lack of confidentiality, potential manipulation of contents and/or sender's address, incorrect recipient (misdirection), viruses etc. Based on previous e-mail correspondence with you and/or an agreement reached with you, UBS considers itself authorized to contact you via e-mail. UBS assumes no responsibility for any loss or damage resulting from the use of e-mails.
The recipient is aware of and accepts the inherent risks of using e-mails, in particular the risk that the banking relationship and confidential information relating thereto are disclosed to third parties.
UBS reserves the right to retain and monitor all messages. Messages are protected and accessed only in legally justified cases.
For information on how UBS uses and discloses personal data, how long we retain it, how we keep it secure and your data protection rights, please see our Privacy Notice http://www.ubs.com/privacy-statement