Parallel stream consumption

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallel stream consumption

Jason Kania
Hi,

I have a question that I have not resolved via the documentation, looking in the "Parallel Execution", "Streaming"  and the "Connectors" sections. If I retrieve a kafka stream and then call the process function against it in parallel, as follows, does it consume in some round robin fashion between the two process calls or is each element coming out of the kafka connector consumed by both processors in parallel?

DataStream<SomeObject> inputStream = ... got a kafka stream
inputStream.process( Function1 );
inputStream.process( Function2 );

If it possible to consume in parallel by pointing at the single stream, is it typically slower or faster than having two kafka streams with different group ids?

If not documented elsewhere, this would be good to cover since it is fundamental.

Thanks,

Jason
Reply | Threaded
Open this post in threaded view
|

Re: Parallel stream consumption

Nico Kruber
Hi Jason,
I'd suggest to start with [1] and [2] for getting the basics of a Flink
program.
The DataStream API basically wires operators together with streams so
that whatever stream gets out of one operator is the input of the next.
By connecting both functions to the same Kafka stream source, your
program results in this:

Kafka --> Function1
      |
      --> Function2

where both functions receive all elements the previous stream offers
(elements are broadcasted). If you want the two functions to work on
different elements, you could add a filter before each function:

DataStream<SomeObject> inputStream = ... got a kafka stream
inputStream.filter(...).process( Function1 );
inputStream.filter(...).process( Function2 );

or split the stream (see [3] for available operators).

I'm no expert on Kafka though, so I can't give you an advise on the
performance - I'd suggest to create some small benchmarks for your setup
since this probably depends on the cluster architecture and the
parallelism of the operators and the number of Kafka partitions.
Maybe Gordon (cc'd) can give some more insights.


Regards
Nico

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/concepts/programming-model.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/api_concepts.html
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/index.html

On 12/01/18 21:57, Jason Kania wrote:

> Hi,
>
> I have a question that I have not resolved via the documentation,
> looking in the "Parallel Execution", "Streaming"  and the "Connectors"
> sections. If I retrieve a kafka stream and then call the process
> function against it in parallel, as follows, does it consume in some
> round robin fashion between the two process calls or is each element
> coming out of the kafka connector consumed by both processors in parallel?
>
> DataStream<SomeObject> inputStream = ... got a kafka stream
> inputStream.process( Function1 );
> inputStream.process( Function2 );
>
> If it possible to consume in parallel by pointing at the single stream,
> is it typically slower or faster than having two kafka streams with
> different group ids?
>
> If not documented elsewhere, this would be good to cover since it is
> fundamental.
>
> Thanks,
>
> Jason


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel stream consumption

Nico Kruber
Just found a nice (but old) blog post that explains Flink's integration
with Kafka:
https://data-artisans.com/blog/kafka-flink-a-practical-how-to

I guess, the basics are still valid


Nico

On 16/01/18 11:17, Nico Kruber wrote:

> Hi Jason,
> I'd suggest to start with [1] and [2] for getting the basics of a Flink
> program.
> The DataStream API basically wires operators together with streams so
> that whatever stream gets out of one operator is the input of the next.
> By connecting both functions to the same Kafka stream source, your
> program results in this:
>
> Kafka --> Function1
>       |
>       --> Function2
>
> where both functions receive all elements the previous stream offers
> (elements are broadcasted). If you want the two functions to work on
> different elements, you could add a filter before each function:
>
> DataStream<SomeObject> inputStream = ... got a kafka stream
> inputStream.filter(...).process( Function1 );
> inputStream.filter(...).process( Function2 );
>
> or split the stream (see [3] for available operators).
>
> I'm no expert on Kafka though, so I can't give you an advise on the
> performance - I'd suggest to create some small benchmarks for your setup
> since this probably depends on the cluster architecture and the
> parallelism of the operators and the number of Kafka partitions.
> Maybe Gordon (cc'd) can give some more insights.
>
>
> Regards
> Nico
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/concepts/programming-model.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/api_concepts.html
> [3]
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/index.html
>
> On 12/01/18 21:57, Jason Kania wrote:
>> Hi,
>>
>> I have a question that I have not resolved via the documentation,
>> looking in the "Parallel Execution", "Streaming"  and the "Connectors"
>> sections. If I retrieve a kafka stream and then call the process
>> function against it in parallel, as follows, does it consume in some
>> round robin fashion between the two process calls or is each element
>> coming out of the kafka connector consumed by both processors in parallel?
>>
>> DataStream<SomeObject> inputStream = ... got a kafka stream
>> inputStream.process( Function1 );
>> inputStream.process( Function2 );
>>
>> If it possible to consume in parallel by pointing at the single stream,
>> is it typically slower or faster than having two kafka streams with
>> different group ids?
>>
>> If not documented elsewhere, this would be good to cover since it is
>> fundamental.
>>
>> Thanks,
>>
>> Jason
>


signature.asc (201 bytes) Download Attachment