Many thanks for your quick reply.
1)
My implementation has no commits. All commits are done in FlinkKafkaProducer class I envisage.
KeyedSerializationSchemaWrapper keyedSerializationSchemaWrapper = new KeyedSerializationSchemaWrapper(new SimpleStringSchema());
new FlinkKafkaProducer<String>("test.out", keyedSerializationSchemaWrapper,KafkaProperities.getProperties(env), FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
If the latency could be as long as the interval of checkpoint, it would be not ideal for a long interval setting e.g. a few minutes
2)
My parallelism is set on the job level, I would expect they all have the same parallelism for each source, operator and sink. Actually,
my test code only has one kafka source, one map and one kafka sink. It has produced duplication in a restart if I use the at least once mode.
Regards,
Min
From: Guowei Ma [mailto:[hidden email]]
Sent: Sonntag, 7. April 2019 08:42
To: Tan, Min
Cc: [hidden email]
Subject: [External] Re: Lantency caised Flink Checkpoint on EXACTLY_ONCE mode
If your implementation only commits your changing after the complete of a checkpoint I think the latency of e2e is at least the interval of checkpoint.
I think the document wants to say that a topology, which only has flatmap/filter/map(no task has more than one input) could achieve the exactly once semantics even in at least mode since the effect of barrier alignments in at least mode
is same as in exactly once mode by coincidence for such topology.
I think there might be some benefits if you could set the parallelism of source/sink/flatmap to the same parallelism(there could exist other way) in some situation since during the alignments the task, which has many inputs would not deal
with the elements behind the barrier in exactly mode until the barriers of all inputs arrive. (If your checkpoint interval is very very long I think there would be no difference).
Best
Guowei
<[hidden email]>于2019年4月7日
周日上午3:14写道:
Hi,
I have a simple data pipeline of a Kafka source, a flink map operator and a Kafka sink.
I have a quick question about latency caused by the checkpoint on the exactly once mode.
Due to the changes are committed and visible on a checkpoint completion, so the latency could be as long as that length of checkpoint interval e.g. 5seconds?
Is my understanding correct?
If I use the at least mode, there will be this addition on latency. More interestingly, the flink document https://ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html indicate that "dataflows with only embarrassingly parallel streaming operations (
map()
,flatMap()
,filter()
, …) actually give exactly once guarantees even in at least once mode."
Unfortunately, I have been not able to achieve the exactly once with the at least once. Do I need more settings than I have with the exactly once mode?
Many thanks for the advises in advance.
Min
--
Best,
Guowei
Free forum by Nabble | Edit this page |