Hi all, We are trying to setup regions to enable Flink to only stop failing tasks based on region instead of failing the entire stream. We are using one main stream that is reading from a kafka topic and a bunch of side outputs for processing each event from that topic differently. For the processing in the side outputs we use the process function provided by flink. So far when one side output stream failed, the whole stream job failed.
Is there anything that needs to be done or set on the Side Outputs so that Flink recognizes them as regions? Is it even possible to have Flink handle side outputs as regions and restart only one specific side output stream on failure? Many thanks in advance! Cheers, Patrick -- Patrick Eifler Senior Software Engineer (BI)
|
Hi Patrick, Flink supports regional failover [1] which only restarts all tasks connected via pipelined data exchanges. Hence, either when having an embarrassingly parallel topology or running a batch job, Flink should not restart the whole job in case of a task failure. However, in the case of side outputs, I think they are connected via pipelined data exchanges with the main stream and, hence, are part of the same failover region as the main stream. Cheers, Till On Tue, Nov 24, 2020 at 5:15 PM Eifler, Patrick <[hidden email]> wrote:
|
Hi Till, Thanks for your reply. Is there any option to disconnect the side outputs from the pipelined data exchanges of the main stream. The benefit of side outputs is very high regarding performance and useability plus it fits the use case here very nicely. Though this pipelined connection to the main stream is a real
concern as all of the streams of the job are cancelled if one fails. With many side outputs that is not really an option we can maintain at scale. So we are looking for options on how to preserve the side outputs but still get the individual region based failover to work. The other thing I have noticed is that when using side outputs is that when looking at the Flink Web UI those streams are just named Unregistered_DataStream_. Setting up .name and .uid
on the sideoutput stream does not change this. Is there any way on naming the side output streams so the customized name appears in the Flink Web UI? Any help or tips are highly appreciated. Many thanks in advance. Cheers, Patrick -- Patrick Eifler Senior Software Engineer (BI)
From: Till Rohrmann <[hidden email]> Hi Patrick, Flink supports regional failover [1] which only restarts all tasks connected via pipelined data exchanges. Hence, either when having an embarrassingly parallel topology or running a batch job, Flink should not restart the whole job in case
of a task failure. However, in the case of side outputs, I think they are connected via pipelined data exchanges with the main stream and, hence, are part of the same failover region as the main stream. Cheers, Till On Tue, Nov 24, 2020 at 5:15 PM Eifler, Patrick <[hidden email]> wrote:
|
Hi Patrick, at the moment it is not possible to disconnect side outputs from other streaming operators. I guess what you would like to have is an operator which consumes on a best effort basis but which can also lose some data while it is being restarted. This is currently not supported by Flink. Concerning the web UI problems, I would suggest creating a JIRA ticket [1] because it sounds like a missing feature. Cheers, Till On Wed, Nov 25, 2020 at 11:15 AM Eifler, Patrick <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |