Hi,
For Flink 1.8 (and 1.9) the only thing that you can do, is to try to limit amount of data buffered between the nodes (check Flink network configuration [1] for number of buffers and or buffer pool sizes). This can reduce maximal throughput (but only if the network transfer is a significant cost, for example if your records are extremely quick to process), but it will speed up checkpointing during back pressure.
There are some plans to address this and maybe there will be some improvement in Flink 1.10.
If your job is completely stalled because of an outage, then I don’t think that you can do much now, since even with only one single buffered record the checkpoints will not progress. We might try to address this, but that’s further down the road.
Piotrek
Hi,
I'm still facing the same issue under 1.8. Our pipeline uses end-to-end
exactly-once semantic, which means the consumer program cannot read the
messages until they are committed. So in case of an outage, the whole
runtime delay is passed over to the next stream processor application and
creates an even larger delay in our processing pipeline. Is there any way to
force the checkpoint to complete even under backpressure situation?
Thank you in advance.
Regards,
--
Mohammad Hosseinian
Software Developer
Information Design One AG
Phone
+49-69-244502-0
Fax +49-69-244502-10
Web www.id1.de
Information Design One AG, Baseler Strasse 10, 60329 Frankfurt am Main, Germany
Registration: Amtsgericht Frankfurt am Main, HRB 52596
Executive Board: Robert Peters, Benjamin Walther, Supervisory Board: Christian Hecht