Data exchange between tasks (operators/sources) at streaming api runtime

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Data exchange between tasks (operators/sources) at streaming api runtime

Ishwara Varnasi
Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi
Reply | Threaded
Open this post in threaded view
|

Re: Data exchange between tasks (operators/sources) at streaming api runtime

Piotr Nowojski
Hi,

As far as I know there is currently no simple way to do this:
and

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.
Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek 

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi

Reply | Threaded
Open this post in threaded view
|

Re: Data exchange between tasks (operators/sources) at streaming api runtime

Ishwara Varnasi
The FLIP-17 is promising. Until it’s available I’m planning to do this: extend Kafka consumer and add logic to hold consuming until other source (fixed set) completes sending and those messages are processed by the application. However the question is to how to let the Kafka consumer know that it should now start consuming messages. What is the correct way to broadcast messages to other tasks at runtime? I’d success with the distributed cache (ie write status to a file in one task and other looks for status in this file), but doesn’t look like good solution although works. 
Thanks for the pointers.
Ishwara Varnasi 

Sent from my iPhone

On Jan 25, 2018, at 4:03 AM, Piotr Nowojski <[hidden email]> wrote:

Hi,

As far as I know there is currently no simple way to do this:
and

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.
Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek 

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi

Reply | Threaded
Open this post in threaded view
|

Re: Data exchange between tasks (operators/sources) at streaming api runtime

Piotr Nowojski
If you want to go this way, you could:
- as you proposed use some busy waiting with reading some file from a distributed file system
- wait for some network message (opening your own socket)
- use some other external system for this purpose: Kafka? Zookeeper?  

Although all of them seems hacky and I would prefer (as I proposed before) to pre compute those ids before running/starting the main Flink application. Probably would be simpler and easier to maintain.

Piotrek

On 25 Jan 2018, at 13:47, Ishwara Varnasi <[hidden email]> wrote:

The FLIP-17 is promising. Until it’s available I’m planning to do this: extend Kafka consumer and add logic to hold consuming until other source (fixed set) completes sending and those messages are processed by the application. However the question is to how to let the Kafka consumer know that it should now start consuming messages. What is the correct way to broadcast messages to other tasks at runtime? I’d success with the distributed cache (ie write status to a file in one task and other looks for status in this file), but doesn’t look like good solution although works. 
Thanks for the pointers.
Ishwara Varnasi 

Sent from my iPhone

On Jan 25, 2018, at 4:03 AM, Piotr Nowojski <[hidden email]> wrote:

Hi,

As far as I know there is currently no simple way to do this:
and

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.
Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek 

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi


Reply | Threaded
Open this post in threaded view
|

Re: Data exchange between tasks (operators/sources) at streaming api runtime

Ishwara Varnasi
Yes, makes sense, I think consider one of those better options. Thanks! 
Ishwara

Sent from my iPhone

On Jan 25, 2018, at 7:12 AM, Piotr Nowojski <[hidden email]> wrote:

If you want to go this way, you could:
- as you proposed use some busy waiting with reading some file from a distributed file system
- wait for some network message (opening your own socket)
- use some other external system for this purpose: Kafka? Zookeeper?  

Although all of them seems hacky and I would prefer (as I proposed before) to pre compute those ids before running/starting the main Flink application. Probably would be simpler and easier to maintain.

Piotrek

On 25 Jan 2018, at 13:47, Ishwara Varnasi <[hidden email]> wrote:

The FLIP-17 is promising. Until it’s available I’m planning to do this: extend Kafka consumer and add logic to hold consuming until other source (fixed set) completes sending and those messages are processed by the application. However the question is to how to let the Kafka consumer know that it should now start consuming messages. What is the correct way to broadcast messages to other tasks at runtime? I’d success with the distributed cache (ie write status to a file in one task and other looks for status in this file), but doesn’t look like good solution although works. 
Thanks for the pointers.
Ishwara Varnasi 

Sent from my iPhone

On Jan 25, 2018, at 4:03 AM, Piotr Nowojski <[hidden email]> wrote:

Hi,

As far as I know there is currently no simple way to do this:
and

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.
Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek 

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi