Hi Team, I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function. Approach I tried to calculate count on the datastream: 1> Read data from table (say past 2 days of data) as Datastream 2> apply Key operation on the datastream 3> then use reduce function to find count, sum and average. I have written output to file and also inserted into table: sample data from file is: vehicleId=aa, count=1, fuel=10, avgFuel=0.0 vehicleId=dd, count=1, fuel=7, avgFuel=0.0 vehicleId=dd, count=2, fuel=22, avgFuel=11.0 vehicleId=dd, count=3, fuel=42, avgFuel=14.0 vehicleId=ee, count=1, fuel=0, avgFuel=0.0 what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below: vehicleId=aa, count=1, fuel=10, avgFuel=0.0 vehicleId=dd, count=3, fuel=42, avgFuel=14.0 vehicleId=ee, count=1, fuel=0, avgFuel=0.0 Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle. Regards, Sunitha. |
Hi All, Request your inputs please. Regards, Sunitha
On Tuesday, October 27, 2020, 01:01:41 PM GMT+5:30, [hidden email] <[hidden email]> wrote:
Hi Team, I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function. Approach I tried to calculate count on the datastream: 1> Read data from table (say past 2 days of data) as Datastream 2> apply Key operation on the datastream 3> then use reduce function to find count, sum and average. I have written output to file and also inserted into table: sample data from file is: vehicleId=aa, count=1, fuel=10, avgFuel=0.0 vehicleId=dd, count=1, fuel=7, avgFuel=0.0 vehicleId=dd, count=2, fuel=22, avgFuel=11.0 vehicleId=dd, count=3, fuel=42, avgFuel=14.0 vehicleId=ee, count=1, fuel=0, avgFuel=0.0 what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below: vehicleId=aa, count=1, fuel=10, avgFuel=0.0 vehicleId=dd, count=3, fuel=42, avgFuel=14.0 vehicleId=ee, count=1, fuel=0, avgFuel=0.0 Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle. Regards, Sunitha. |
In SQL, you can use the over window to deduplicate the messages by the id [1], but i'm not sure if there are same semantic operators in DataStream.
|
Hi Sunitha, you probably need to apply a non-windowed grouping. datastream This example will always throw away the second record. You may want to combine the records though by summing up the fuel. Best, Arvid On Wed, Oct 28, 2020 at 8:47 AM Danny Chan <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |