Working with bounded Datastreams - Flink 1.11.1

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Working with bounded Datastreams - Flink 1.11.1

s_penakalapati@yahoo.com
Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

 Approach I tried to calculate count on the datastream:
1> Read data from table (say past 2 days of data) as Datastream
2> apply Key operation on the datastream
3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=1, fuel=7, avgFuel=0.0
vehicleId=dd, count=2, fuel=22, avgFuel=11.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below:
vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,
Sunitha.

Reply | Threaded
Open this post in threaded view
|

Re: Working with bounded Datastreams - Flink 1.11.1

s_penakalapati@yahoo.com
Hi All,

Request your inputs please.

Regards,
Sunitha

On Tuesday, October 27, 2020, 01:01:41 PM GMT+5:30, [hidden email] <[hidden email]> wrote:


Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

 Approach I tried to calculate count on the datastream:
1> Read data from table (say past 2 days of data) as Datastream
2> apply Key operation on the datastream
3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=1, fuel=7, avgFuel=0.0
vehicleId=dd, count=2, fuel=22, avgFuel=11.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below:
vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,
Sunitha.

Reply | Threaded
Open this post in threaded view
|

Re: Working with bounded Datastreams - Flink 1.11.1

Danny Chan-2
In SQL, you can use the over window to deduplicate the messages by the id [1], but i'm not sure if there are same semantic operators in DataStream.


[hidden email] <[hidden email]> 于2020年10月28日周三 下午12:34写道:
Hi All,

Request your inputs please.

Regards,
Sunitha

On Tuesday, October 27, 2020, 01:01:41 PM GMT+5:30, [hidden email] <[hidden email]> wrote:


Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

 Approach I tried to calculate count on the datastream:
1> Read data from table (say past 2 days of data) as Datastream
2> apply Key operation on the datastream
3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=1, fuel=7, avgFuel=0.0
vehicleId=dd, count=2, fuel=22, avgFuel=11.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below:
vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,
Sunitha.

Reply | Threaded
Open this post in threaded view
|

Re: Working with bounded Datastreams - Flink 1.11.1

Arvid Heise-3
Hi Sunitha,

you probably need to apply a non-windowed grouping.
datastream
.keyBy(Event::getVehicleId)
.reduce((first, other) -> first);
This example will always throw away the second record. You may want to combine the records though by summing up the fuel.

Best,

Arvid

On Wed, Oct 28, 2020 at 8:47 AM Danny Chan <[hidden email]> wrote:
In SQL, you can use the over window to deduplicate the messages by the id [1], but i'm not sure if there are same semantic operators in DataStream.


[hidden email] <[hidden email]> 于2020年10月28日周三 下午12:34写道:
Hi All,

Request your inputs please.

Regards,
Sunitha

On Tuesday, October 27, 2020, 01:01:41 PM GMT+5:30, [hidden email] <[hidden email]> wrote:


Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

 Approach I tried to calculate count on the datastream:
1> Read data from table (say past 2 days of data) as Datastream
2> apply Key operation on the datastream
3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=1, fuel=7, avgFuel=0.0
vehicleId=dd, count=2, fuel=22, avgFuel=11.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below:
vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,
Sunitha.



--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng