Hi, I have a question that I have not resolved via the documentation, looking in the "Parallel Execution", "Streaming" and the "Connectors" sections. If I retrieve a kafka stream and then call the process function against it in parallel, as follows, does it consume in some round robin fashion between the two process calls or is each element coming out of the kafka connector consumed by both processors in parallel? DataStream<SomeObject> inputStream = ... got a kafka stream inputStream.process( Function1 ); inputStream.process( Function2 ); If it possible to consume in parallel by pointing at the single stream, is it typically slower or faster than having two kafka streams with different group ids? If not documented elsewhere, this would be good to cover since it is fundamental. Thanks, Jason |
Hi Jason,
I'd suggest to start with [1] and [2] for getting the basics of a Flink program. The DataStream API basically wires operators together with streams so that whatever stream gets out of one operator is the input of the next. By connecting both functions to the same Kafka stream source, your program results in this: Kafka --> Function1 | --> Function2 where both functions receive all elements the previous stream offers (elements are broadcasted). If you want the two functions to work on different elements, you could add a filter before each function: DataStream<SomeObject> inputStream = ... got a kafka stream inputStream.filter(...).process( Function1 ); inputStream.filter(...).process( Function2 ); or split the stream (see [3] for available operators). I'm no expert on Kafka though, so I can't give you an advise on the performance - I'd suggest to create some small benchmarks for your setup since this probably depends on the cluster architecture and the parallelism of the operators and the number of Kafka partitions. Maybe Gordon (cc'd) can give some more insights. Regards Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/concepts/programming-model.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/api_concepts.html [3] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/index.html On 12/01/18 21:57, Jason Kania wrote: > Hi, > > I have a question that I have not resolved via the documentation, > looking in the "Parallel Execution", "Streaming" and the "Connectors" > sections. If I retrieve a kafka stream and then call the process > function against it in parallel, as follows, does it consume in some > round robin fashion between the two process calls or is each element > coming out of the kafka connector consumed by both processors in parallel? > > DataStream<SomeObject> inputStream = ... got a kafka stream > inputStream.process( Function1 ); > inputStream.process( Function2 ); > > If it possible to consume in parallel by pointing at the single stream, > is it typically slower or faster than having two kafka streams with > different group ids? > > If not documented elsewhere, this would be good to cover since it is > fundamental. > > Thanks, > > Jason signature.asc (201 bytes) Download Attachment |
Just found a nice (but old) blog post that explains Flink's integration
with Kafka: https://data-artisans.com/blog/kafka-flink-a-practical-how-to I guess, the basics are still valid Nico On 16/01/18 11:17, Nico Kruber wrote: > Hi Jason, > I'd suggest to start with [1] and [2] for getting the basics of a Flink > program. > The DataStream API basically wires operators together with streams so > that whatever stream gets out of one operator is the input of the next. > By connecting both functions to the same Kafka stream source, your > program results in this: > > Kafka --> Function1 > | > --> Function2 > > where both functions receive all elements the previous stream offers > (elements are broadcasted). If you want the two functions to work on > different elements, you could add a filter before each function: > > DataStream<SomeObject> inputStream = ... got a kafka stream > inputStream.filter(...).process( Function1 ); > inputStream.filter(...).process( Function2 ); > > or split the stream (see [3] for available operators). > > I'm no expert on Kafka though, so I can't give you an advise on the > performance - I'd suggest to create some small benchmarks for your setup > since this probably depends on the cluster architecture and the > parallelism of the operators and the number of Kafka partitions. > Maybe Gordon (cc'd) can give some more insights. > > > Regards > Nico > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.4/concepts/programming-model.html > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/api_concepts.html > [3] > https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/index.html > > On 12/01/18 21:57, Jason Kania wrote: >> Hi, >> >> I have a question that I have not resolved via the documentation, >> looking in the "Parallel Execution", "Streaming" and the "Connectors" >> sections. If I retrieve a kafka stream and then call the process >> function against it in parallel, as follows, does it consume in some >> round robin fashion between the two process calls or is each element >> coming out of the kafka connector consumed by both processors in parallel? >> >> DataStream<SomeObject> inputStream = ... got a kafka stream >> inputStream.process( Function1 ); >> inputStream.process( Function2 ); >> >> If it possible to consume in parallel by pointing at the single stream, >> is it typically slower or faster than having two kafka streams with >> different group ids? >> >> If not documented elsewhere, this would be good to cover since it is >> fundamental. >> >> Thanks, >> >> Jason > signature.asc (201 bytes) Download Attachment |
Free forum by Nabble | Edit this page |