Hi all, I am not sure when I should go for multiple jobs or have 1 job with all the sources and sinks. Following is my code. val env = StreamExecutionEnvironment.getExecutionEnvironment ....... // create a Kafka source val srcstream = env.addSource(consumer) srcstream .keyBy(0) .window(ProcessingTimeSessionWindows.withGap(Time.days(14))) .reduce ... .map ... .addSink ... srcstream .keyBy(0) .window(ProcessingTimeSessionWindows.withGap(Time.days(28))) .reduce ... .map ... .addSink ... env.execute("Job1") My questions 1. The srcstream is a very high volume stream and the window size is 2 weeks and 4 weeks. Is the window size a problem? In this case, I think it is not a problem because I am using reduce which stores only 1 value per window. Is that right? 2. I am having 2 output operations one with 2 weeks window and the other with 4 weeks window. Are they executed in parallel or in sequence? 3. When I have multiple output operations like in this case should I break it into 2 different jobs ? 4. Can I run multiple jobs on the same cluster? Thanks |
Hi anna, 1. The srcstream is a very high volume stream and the window size is 2 weeks and 4 weeks. Is the window size a problem? In this case, I think it is not a problem because I am using reduce which stores only 1 value per window. Is that right? >> Window Size is based on your business needs settings. However, if the window size is too large, the status of the job will be large, which will result in a longer recovery failure. You need to be aware of this. One value per window is just a value calculated by the window. It caches all data for the period of time before the window is triggered. 2. I am having 2 output operations one with 2 weeks window and the other with 4 weeks window. Are they executed in parallel or in sequence? >> These two windows are calculated in parallel. 3. When I have multiple output operations like in this case should I break it into 2 different jobs ? >> Both modes are ok. When there is only one job, the two windows will share the source stream, but this will result in a larger state of the job and a slower recovery. When split into two jobs, there will be two consumptions of kafka, but the two windows are independent in both jobs. 4. Can I run multiple jobs on the same cluster? >> For Standalone cluster mode or Yarn Flink Session mode, etc., there is no problem. For Flink on yarn single job mode, a cluster can usually only run one job, which is the recommended mode. Thanks, vino. 2018-07-31 15:11 GMT+08:00 anna stax <[hidden email]>:
|
> 在 2018年7月31日,15:47,vino yang <[hidden email]> 写道: > > Hi anna, > > 1. The srcstream is a very high volume stream and the window size is 2 weeks and 4 weeks. Is the window size a problem? In this case, I think it is not a problem because I am using reduce which stores only 1 value per window. Is that right? > > >> Window Size is based on your business needs settings. However, if the window size is too large, the status of the job will be large, which will result in a longer recovery failure. You need to be aware of this. One value per window is just a value calculated by the window. It caches all data for the period of time before the window is triggered. > > 2. I am having 2 output operations one with 2 weeks window and the other with 4 weeks window. Are they executed in parallel or in sequence? > > >> These two windows are calculated in parallel. > > 3. When I have multiple output operations like in this case should I break it into 2 different jobs ? > > >> Both modes are ok. When there is only one job, the two windows will share the source stream, but this will result in a larger state of the job and a slower recovery. When split into two jobs, there will be two consumptions of kafka, but the two windows are independent in both jobs. > > 4. Can I run multiple jobs on the same cluster? > > >> For Standalone cluster mode or Yarn Flink Session mode, etc., there is no problem. For Flink on yarn single job mode, a cluster can usually only run one job, which is the recommended mode. > > Thanks, vino. > > 2018-07-31 15:11 GMT+08:00 anna stax <[hidden email]>: > Hi all, > > I am not sure when I should go for multiple jobs or have 1 job with all the sources and sinks. Following is my code. > > val env = StreamExecutionEnvironment.getExecutionEnvironment > ....... > // create a Kafka source > val srcstream = env.addSource(consumer) > > srcstream > .keyBy(0) > .window(ProcessingTimeSessionWindows.withGap(Time.days(14))) > .reduce ... > .map ... > .addSink ... > > srcstream > .keyBy(0) > .window(ProcessingTimeSessionWindows.withGap(Time.days(28))) > .reduce ... > .map ... > .addSink ... > > env.execute("Job1") > > My questions > > 1. The srcstream is a very high volume stream and the window size is 2 weeks and 4 weeks. Is the window size a problem? In this case, I think it is not a problem because I am using reduce which stores only 1 value per window. Is that right? > > 2. I am having 2 output operations one with 2 weeks window and the other with 4 weeks window. Are they executed in parallel or in sequence? > > 3. When I have multiple output operations like in this case should I break it into 2 different jobs ? > > 4. Can I run multiple jobs on the same cluster? > > Thanks > > > Hi yang, I’m a bit confused about the window data cache that you mentioned. > It caches all data for the period of time before the window is triggered. In my understanding, window functions process elements incrementally unless the low level API ProcessWindowFunction was used, so caching data should not be required in most scenarios. Would you mind giving more details of the window caching design? And please correct me if I’m wrong. Thanks a lot. Best regards, Paul Lam |
Hi Paul, Yes, I am talking about the normal case, Flink must store the data in the window as a state to prevent failure. In some scenarios your understanding is also correct, and flink uses the window pane to optimize window calculations. So, if your scene is in optimized mode, ignore this. Thanks, vino. 2018-08-02 16:11 GMT+08:00 Paul Lam <[hidden email]>:
|
Hi, Paul is right. Which and how much data is stored in state for a window depends on the type of the function that is applied on the windows: - ReduceFunction: Only the reduced value is stored - AggregateFunction: Only the accumulator value is stored - WindowFunction or ProcessWindowFunction: All original records are stored. So in Anna's jobs, each window only stores a single value. Hence, the state size is independent of the size of the window (unless, the reduced value collects values of all input records, e.g., in a list or set). Best, Fabian 2018-08-02 10:29 GMT+02:00 vino yang <[hidden email]>:
|
Free forum by Nabble | Edit this page |