(DEPRECATED) Apache Flink User Mailing List archive.

Data exchange between tasks (operators/sources) at streaming api runtime

Classic

List

Threaded

5 messages Options

Ishwara Varnasi

Data exchange between tasks (operators/sources) at streaming api runtime

Hello,

I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?

thanks

Ishwara Varnasi

Piotr Nowojski

Re: Data exchange between tasks (operators/sources) at streaming api runtime

Hi,

As far as I know there is currently no simple way to do this:

Join stream with static data in https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API

and

https://issues.apache.org/jira/browse/FLINK-6131

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.

Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi

Ishwara Varnasi

Re: Data exchange between tasks (operators/sources) at streaming api runtime

The FLIP-17 is promising. Until it’s available I’m planning to do this: extend Kafka consumer and add logic to hold consuming until other source (fixed set) completes sending and those messages are processed by the application. However the question is to how to let the Kafka consumer know that it should now start consuming messages. What is the correct way to broadcast messages to other tasks at runtime? I’d success with the distributed cache (ie write status to a file in one task and other looks for status in this file), but doesn’t look like good solution although works.

Thanks for the pointers.

Ishwara Varnasi

Sent from my iPhone

On Jan 25, 2018, at 4:03 AM, Piotr Nowojski <[hidden email]> wrote:

Hi,

As far as I know there is currently no simple way to do this:
Join stream with static data in https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
and
https://issues.apache.org/jira/browse/FLINK-6131

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.
Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi

Piotr Nowojski

Re: Data exchange between tasks (operators/sources) at streaming api runtime

If you want to go this way, you could:

- as you proposed use some busy waiting with reading some file from a distributed file system

- wait for some network message (opening your own socket)

- use some other external system for this purpose: Kafka? Zookeeper?

Although all of them seems hacky and I would prefer (as I proposed before) to pre compute those ids before running/starting the main Flink application. Probably would be simpler and easier to maintain.

Piotrek

On 25 Jan 2018, at 13:47, Ishwara Varnasi <[hidden email]> wrote:

The FLIP-17 is promising. Until it’s available I’m planning to do this: extend Kafka consumer and add logic to hold consuming until other source (fixed set) completes sending and those messages are processed by the application. However the question is to how to let the Kafka consumer know that it should now start consuming messages. What is the correct way to broadcast messages to other tasks at runtime? I’d success with the distributed cache (ie write status to a file in one task and other looks for status in this file), but doesn’t look like good solution although works.
Thanks for the pointers.
Ishwara Varnasi

Sent from my iPhone

On Jan 25, 2018, at 4:03 AM, Piotr Nowojski <[hidden email]> wrote:

Hi,

As far as I know there is currently no simple way to do this:
Join stream with static data in https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
and
https://issues.apache.org/jira/browse/FLINK-6131

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.
Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi

Ishwara Varnasi

Re: Data exchange between tasks (operators/sources) at streaming api runtime

Yes, makes sense, I think consider one of those better options. Thanks!

Ishwara

Sent from my iPhone

On Jan 25, 2018, at 7:12 AM, Piotr Nowojski <[hidden email]> wrote:

If you want to go this way, you could:
- as you proposed use some busy waiting with reading some file from a distributed file system
- wait for some network message (opening your own socket)
- use some other external system for this purpose: Kafka? Zookeeper?

Although all of them seems hacky and I would prefer (as I proposed before) to pre compute those ids before running/starting the main Flink application. Probably would be simpler and easier to maintain.

Piotrek

On 25 Jan 2018, at 13:47, Ishwara Varnasi <[hidden email]> wrote:

The FLIP-17 is promising. Until it’s available I’m planning to do this: extend Kafka consumer and add logic to hold consuming until other source (fixed set) completes sending and those messages are processed by the application. However the question is to how to let the Kafka consumer know that it should now start consuming messages. What is the correct way to broadcast messages to other tasks at runtime? I’d success with the distributed cache (ie write status to a file in one task and other looks for status in this file), but doesn’t look like good solution although works.
Thanks for the pointers.
Ishwara Varnasi

Sent from my iPhone

On Jan 25, 2018, at 4:03 AM, Piotr Nowojski <[hidden email]> wrote:

Hi,

As far as I know there is currently no simple way to do this:
Join stream with static data in https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
and
https://issues.apache.org/jira/browse/FLINK-6131

One walk around might be to buffer on the state the Kafka input in your TwoInput operator until all of the broadcasted messages have arrived.
Another option might be to dynamically start your application. First run some computation to determine the fixed lists of ids and start the flink application with those values hardcoded in/passed via command line arguments.

Piotrek

On 25 Jan 2018, at 04:10, Ishwara Varnasi <[hidden email]> wrote:

Hello,
I have a scenario where I've two sources, one of them is source of fixed list of ids for preloading (caching certain info which is slow) and second one is the kafka consumer. I need to run Kafka after first one completes. I need a mechanism to let the Kafka consumer know that it can start consuming messages. How can I achieve this?
thanks
Ishwara Varnasi