Finite source without blocking save-points

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Finite source without blocking save-points

Gaël Renoux
Hello everyone,

I have a job which runs continuously, but it also needs to send a single specific Kafka message on startup. I tried the obvious approach to use StreamExecutionEnvironment.fromElements and add a Kafka sink, however that's not possible: the source being finished, it becomes impossible to stop the job with a save-point later.

The best solution I found is creating a basic Kafka producer to send the message, and running that producer inside the job's startup script (before calling StreamExecutionEnvironment.execute()). However, there's a race condition, where the message could be sent and trigger stuff before the job is ready to receive messages. In addition, it forces me to have a separate Kafka producer, while Flink already comes with Kafka sinks. And finally, it's pretty specific to my use case (sending a Kafka message), and it looks like there should be a generic solution here.

Do you guys know of any better way to do this? Is there any way to set up a finite source that will not block save-points?

Just in case, the global use case is nothing special: my job maintains a set of rules as broadcast state in operators and handle input according to those rules. On startup, I need to request all rules to be sent at once (the emitter normally sends updated rules only), in case the rule state has been lost (happens when we evolve the rule model, for instance), and this is done through a Kafka message.

Thanks in advance!

Gaël Renoux
Reply | Threaded
Open this post in threaded view
|

Re: Finite source without blocking save-points

bupt_ljy

Hi Gael,

I had a similar situation before. Actually you don’t need to accomplish this in such a complicated way. I guess you’ve already had a rules source and you can send rules in #open function for a startup if your rules source inherit from #RichParallelSourceFunction.


Best,

Jiayi Liao


 Original Message 
Sender: Gaël Renoux<[hidden email]>
Recipient: user<[hidden email]>
Date: Monday, Nov 4, 2019 22:50
Subject: Finite source without blocking save-points

Hello everyone,

I have a job which runs continuously, but it also needs to send a single specific Kafka message on startup. I tried the obvious approach to use StreamExecutionEnvironment.fromElements and add a Kafka sink, however that's not possible: the source being finished, it becomes impossible to stop the job with a save-point later.

The best solution I found is creating a basic Kafka producer to send the message, and running that producer inside the job's startup script (before calling StreamExecutionEnvironment.execute()). However, there's a race condition, where the message could be sent and trigger stuff before the job is ready to receive messages. In addition, it forces me to have a separate Kafka producer, while Flink already comes with Kafka sinks. And finally, it's pretty specific to my use case (sending a Kafka message), and it looks like there should be a generic solution here.

Do you guys know of any better way to do this? Is there any way to set up a finite source that will not block save-points?

Just in case, the global use case is nothing special: my job maintains a set of rules as broadcast state in operators and handle input according to those rules. On startup, I need to request all rules to be sent at once (the emitter normally sends updated rules only), in case the rule state has been lost (happens when we evolve the rule model, for instance), and this is done through a Kafka message.

Thanks in advance!

Gaël Renoux
Reply | Threaded
Open this post in threaded view
|

Re: Finite source without blocking save-points

Gaël Renoux
Hi Jiayi,

This would allow me to call the Kafka producer without risking a race condition, but it comes with its own problem: unless the source has a parallelism of 1, it will trigger multiple times. I can create a specific source that doesn't produce anything, has a parallelism of 1, and calls the producer from its open method : it's a bit ugly, but it would get rid of the race condition.

On Mon, Nov 4, 2019 at 3:59 PM bupt_ljy <[hidden email]> wrote:

Hi Gael,

I had a similar situation before. Actually you don’t need to accomplish this in such a complicated way. I guess you’ve already had a rules source and you can send rules in #open function for a startup if your rules source inherit from #RichParallelSourceFunction.


Best,

Jiayi Liao


 Original Message 
Sender: Gaël Renoux<[hidden email]>
Recipient: user<[hidden email]>
Date: Monday, Nov 4, 2019 22:50
Subject: Finite source without blocking save-points

Hello everyone,

I have a job which runs continuously, but it also needs to send a single specific Kafka message on startup. I tried the obvious approach to use StreamExecutionEnvironment.fromElements and add a Kafka sink, however that's not possible: the source being finished, it becomes impossible to stop the job with a save-point later.

The best solution I found is creating a basic Kafka producer to send the message, and running that producer inside the job's startup script (before calling StreamExecutionEnvironment.execute()). However, there's a race condition, where the message could be sent and trigger stuff before the job is ready to receive messages. In addition, it forces me to have a separate Kafka producer, while Flink already comes with Kafka sinks. And finally, it's pretty specific to my use case (sending a Kafka message), and it looks like there should be a generic solution here.

Do you guys know of any better way to do this? Is there any way to set up a finite source that will not block save-points?

Just in case, the global use case is nothing special: my job maintains a set of rules as broadcast state in operators and handle input according to those rules. On startup, I need to request all rules to be sent at once (the emitter normally sends updated rules only), in case the rule state has been lost (happens when we evolve the rule model, for instance), and this is done through a Kafka message.

Thanks in advance!

Gaël Renoux


--
Gaël Renoux
Senior R&D Engineer, DataDome
M <a href="tel:+33+6+76+89+16+52" style="text-decoration:none;color:rgb(68,68,68);font-family:Arial,Helvetica,sans-serif" target="_blank"> +33 6 76 89 16 52 
E [hidden email]
W www.datadome.co
  
Read DataDome reviews on G2
Reply | Threaded
Open this post in threaded view
|

Re: Finite source without blocking save-points

bupt_ljy
In reply to this post by Gaël Renoux



Oh the parallelism problem didn’t bother me because we used to set the parallelism of rule source to be one :o). Maybe a more elegant way is hashing the rule emitting by #RuntimeContext#getIndexOfThisSubtask.


Best,

Jiayi Liao


 Original Message 
Sender: Gaël Renoux<[hidden email]>
Recipient: bupt_ljy<[hidden email]>
Cc: user<[hidden email]>
Date: Monday, Nov 4, 2019 23:41
Subject: Re: Finite source without blocking save-points

Hi Jiayi,

This would allow me to call the Kafka producer without risking a race condition, but it comes with its own problem: unless the source has a parallelism of 1, it will trigger multiple times. I can create a specific source that doesn't produce anything, has a parallelism of 1, and calls the producer from its open method : it's a bit ugly, but it would get rid of the race condition.

On Mon, Nov 4, 2019 at 3:59 PM bupt_ljy <[hidden email]> wrote:

Hi Gael,

I had a similar situation before. Actually you don’t need to accomplish this in such a complicated way. I guess you’ve already had a rules source and you can send rules in #open function for a startup if your rules source inherit from #RichParallelSourceFunction.


Best,

Jiayi Liao


 Original Message 
Sender: Gaël Renoux<[hidden email]>
Recipient: user<[hidden email]>
Date: Monday, Nov 4, 2019 22:50
Subject: Finite source without blocking save-points

Hello everyone,

I have a job which runs continuously, but it also needs to send a single specific Kafka message on startup. I tried the obvious approach to use StreamExecutionEnvironment.fromElements and add a Kafka sink, however that's not possible: the source being finished, it becomes impossible to stop the job with a save-point later.

The best solution I found is creating a basic Kafka producer to send the message, and running that producer inside the job's startup script (before calling StreamExecutionEnvironment.execute()). However, there's a race condition, where the message could be sent and trigger stuff before the job is ready to receive messages. In addition, it forces me to have a separate Kafka producer, while Flink already comes with Kafka sinks. And finally, it's pretty specific to my use case (sending a Kafka message), and it looks like there should be a generic solution here.

Do you guys know of any better way to do this? Is there any way to set up a finite source that will not block save-points?

Just in case, the global use case is nothing special: my job maintains a set of rules as broadcast state in operators and handle input according to those rules. On startup, I need to request all rules to be sent at once (the emitter normally sends updated rules only), in case the rule state has been lost (happens when we evolve the rule model, for instance), and this is done through a Kafka message.

Thanks in advance!

Gaël Renoux


--
Gaël Renoux
Senior R&D Engineer, DataDome
M <a href="tel:+33+6+76+89+16+52" style="text-decoration:none;color:rgb(68,68,68);font-family:Arial,Helvetica,sans-serif" target="_blank"> +33 6 76 89 16 52 
E [hidden email]
W www.datadome.co
  
Read DataDome reviews on G2