Hello people, I've written down some quick questions for which I couldn't find much or anything in the documentation. I hope you can answer some of them! # Multiple consumers 1. Is it possible to .union() streams of different classes? It is useful to create a consumer that counts elements on different topics for example, using a key such as the class name of the element, and a tumbling window of 5 mins let's say. 2. In case #1 is not possible, I need to launch multiple consumers to achieve the same effect. However, I'm getting a "Factory already initialized" error if I run environment.execute() for two consumers on different threads. How do you .execute() more than one consumer on the same application? # Custom triggers 3. If a custom .trigger() overwrites the trigger of the WindowAssigner used previously, why do we have to specify a WindowAssigner (such as TumblingProcessingTimeWindows) in order to be able to specify a custom trigger? Shouldn't it be possible to send a trigger to .window()? 4. I need a stream with a CountWindow (size 10, slide 1 let's say) that may take more than 10 hours fill for the first time, but in the meanwhile I want to process whatever elements already generated. I guess the way to do this is to create a custom trigger that fires on every new element, with up to 10 elements at a time. The result would be windows of sizes: 1 element, then 2, 3, ..., 9, 10, 10, 10, .... Is there a way to achieve this with predefined triggers or a custom trigger is the only way to go here? Best regards, Matt |
For #1 there are a couple of ways to do this. The easiest is probably stream1.connect(stream2).map(...) where the MapFunction maps the two input types to a common type that you can then process uniformly. For #3 There must always be a WindowAssigner specified. There are some convenient ways to do this in the API such at timeWindow(), or window(TumblingProcessingTimeWindows.of(...)), etc, however you always must do this whether your provide your own trigger implementation or not. The way to use window(...) with and customer trigger is just: stream.keyBy(...).window(...).trigger(...).apply(...) or something similar. Not sure if I answered your question though.. For #4: If I understand you correctly that is exactly what CountWindow(10, 1) does already. For example if your input data was a sequence of integers starting with 0 the output would be: (0) (0, 1) (0, 1, 2) (0, 1, 2, 3) (0, 1, 2, 3, 4) (0, 1, 2, 3, 4, 5) (0, 1, 2, 3, 4, 5, 6) (0, 1, 2, 3, 4, 5, 6, 7) (0, 1, 2, 3, 4, 5, 6, 7, 8) (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) (2, 3, 4, 5, 6, 7, 8, 9, 10, 11) (3, 4, 5, 6, 7, 8, 9, 10, 11, 12) (4, 5, 6, 7, 8, 9, 10, 11, 12, 13) ... etc -Jamie On Wed, Dec 14, 2016 at 9:17 AM, Matt <[hidden email]> wrote:
|
Hey Jamie, Ok with #1. I guess #2 is just not possible. I got it about #3. I just checked the code for the tumbling window assigner and I noticed it's just its default trigger that gets overwritten when using a custom trigger, not the way it assigns windows, it makes sense now. Regarding #4, after doing some more tests I think it's more complex than I first thought. I'll probably create another thread explaining more that specific question. Thanks, Matt On Wed, Dec 14, 2016 at 2:52 PM, Jamie Grier <[hidden email]> wrote:
|
Ahh, sorry, for #2: A single Flink job can have as many sources as you like. They can be combined in multiple ways, via things like joins, or connect(), etc. They can also be completely independent — in other words the data flow graph can be completely disjoint. You never to need to call execute() more than once. Just define you program, with as many sources as you want, and then call execute().
On Wed, Dec 14, 2016 at 4:02 PM, Matt <[hidden email]> wrote:
|
Got it. Thanks!
|
1. If we have multiple sources, can the streams be parallelized ? 2. Can we have multiple sinks as well? On Dec 14, 2016 10:46 PM, <[hidden email]> wrote:
|
All streams can be parallelized in Flink even with only one source. You can have multiple sinks as well. On Thu, Dec 15, 2016 at 7:56 AM, Meghashyam Sandeep V <[hidden email]> wrote:
|
To be more clear... A single source in a Flink program is a logical concept. Flink jobs are run with some level of parallelism meaning that multiple copies of your source (and all other) functions are run distributed across a cluster. So if you have a streaming program with two sources and you run with a parallelism of 8 there are actually a total of 16 source functions executing on the cluster -- 8 instances of each of the two source operators you've defined in your Flink job. For more info on this you may want to read through the following: https://ci.apache.org/projects/flink/flink-docs-release-1.1/concepts/concepts.html On Thu, Dec 15, 2016 at 3:21 PM, Jamie Grier <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |