I have a Kafka source that I would like to run a batch job on. Since Version 1.12.0 is now soft deprecating the DataSet API in favor of the DataStream API, can someone show me an example of this? (Using DataStream)
thanks -- Robert Cullen 240-475-4490 |
Hi Robert, you basically just (re)write your application with DataStream API, use the new KafkaSource, and then let the automatic batch detection mode work [1]. The most important part is that all your sources need to be bounded. Assuming that you just have a Kafka source, you need to use setBounded with the appropriate end offset/timestamp. Note that the rewritten Kafka source still has a couple of issues that should be addressed by the first bugfix release of 1.12 in this month. So while it's safe to use for development, I'd wait for 1.12.1 to roll it out on production. If you have specific questions on the migration from DataSet and DataStream, please let me know. On Mon, Jan 4, 2021 at 7:34 PM Robert Cullen <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Arvid, Thanks, Can you show me an example of how the source is tied to the ExecutionEnivornment.
On Tue, Jan 5, 2021 at 7:28 AM Arvid Heise <[hidden email]> wrote:
-- Robert Cullen 240-475-4490 |
Robert, here I modified your example with some highlights.
You can also explicitely set but that shouldn't be necessary (and may make things more complicated once you also want to execute the application in streaming mode).
On Tue, Jan 5, 2021 at 4:51 PM Robert Cullen <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Arvid, Thank you. Sorry, my last post was not clear. This line:
does not compile since addSource is expecting a SourceFunction not a KafkaSource type. On Tue, Jan 5, 2021 at 11:16 AM Arvid Heise <[hidden email]> wrote:
-- Robert Cullen 240-475-4490 |
Sorry Robert for not checking the complete example. New sources are used with fromSource instead of addSource. It's not ideal but hopefully we can remove the old way rather soonish to avoid confusion. On Tue, Jan 5, 2021 at 5:23 PM Robert Cullen <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Arvid, I’m hoping to get your input on a process I’m working on. Originally I was using a streaming solution but noticed that the data in the sliding windows was getting too large over longer intervals and sometimes stopped processing altogether. Anyway, the total counts should be a fixed number so a batch process would be more acceptable. The use case is this: Get counts on keys for 30 minutes of data, take those totals and take a 30 second time slice on the same data, possibly consecutive time slices, take the results and run it through one function: Originally my code looked like this using Sliding Time Windows in streaming mode:
This does not work in batch mode. So I need some guidance. Thanks! On Tue, Jan 5, 2021 at 11:29 AM Arvid Heise <[hidden email]> wrote:
-- Robert Cullen 240-475-4490 |
Hi Robert, The most reliable way to use batch mode in streaming is to use event time [1]. Processing time windows or ingestion time does not make a whole lot of sense if you want to do some kind of reprocessing (indeterministic results and resource usage because the timestamp of records change with every execution). For windows to work in event time, you often need to define watermark strategy [2]. Note that in your example, you used the old source which doesn't support batch execution mode. Here is a sketch on how I'd modify it
Note that the specific watermark strategy depends on your data. I have chosen the most common strategy for Kafka which assumes that in each partition timestamps are (non-strictly) increasing. If you have some out of order events, you probably need forBoundedOutOfOrderness. On Tue, Jan 5, 2021 at 10:21 PM Robert Cullen <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |