versin: 1.8.3
graph: source -> map -> sink Scenes: source subtask failed causes the graph to restart, but the exception displayed on the flink UI is not the cause of the task failure displayed: JM log: 020-06-22 14:29:01.087 INFO org.apache.flink.runtime. java.lang.Exception: Could not perform checkpoint 87 for operator Sink: adapterOutput (19/30). at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.runtime. at java.lang.Thread.run(Thread. Caused by: java.lang.Exception: Could not complete snapshot 87 for operator Sink: adapterOutput (19/30). at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. ... 8 common frames omitted Caused by: java.lang.Exception: Failed to send data to Kafka: The server disconnected before a response was received. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. at org.apache.flink.streaming. ... 13 common frames omitted TM log:Running to Cannceling 2020-06-22 15:39:19.816 INFO com.xxx.client.consumer. 2020-06-22 15:39:19.816 INFO org.apache.flink.runtime. Is this a known issue?
|
Hi Andrew, this looks like your Flink cluster has a flaky connection to the Kafka cluster or your Kafka cluster was down. Since the operator failed on the sync part of the snapshot, it resorted to failure to avoid having inconsistent operator state. If you configured restarts, it just restart from your last checkpoint 86 and recompute the data. What would be your expectation? That the checkpoint fails but the job continues without restart? In general, the issue with Kafka is that the transaction used for exactly once, eventually time out. So if too many checkpoints cannot be taken, you'd ultimately have incorrect data. Hence, failing and restarting is cleaner as it guarantees consistent data. On Mon, Jun 22, 2020 at 10:16 AM Andrew <[hidden email]> wrote: versin: 1.8.3 -- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |