(DEPRECATED) Apache Flink User Mailing List archive.

Working with bounded Datastreams - Flink 1.11.1

Classic

List

Threaded

4 messages Options

s_penakalapati@yahoo.com

Working with bounded Datastreams - Flink 1.11.1

Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

Approach I tried to calculate count on the datastream:

1> Read data from table (say past 2 days of data) as Datastream

2> apply Key operation on the datastream

3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0

vehicleId=dd, count=1, fuel=7, avgFuel=0.0

vehicleId=dd, count=2, fuel=22, avgFuel=11.0

vehicleId=dd, count=3, fuel=42, avgFuel=14.0

vehicleId=ee, count=1, fuel=0, avgFuel=0.0

what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0

vehicleId=dd, count=3, fuel=42, avgFuel=14.0

vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,

Sunitha.

s_penakalapati@yahoo.com

Re: Working with bounded Datastreams - Flink 1.11.1

Hi All,

Request your inputs please.

Regards,

Sunitha

On Tuesday, October 27, 2020, 01:01:41 PM GMT+5:30, [hidden email] <[hidden email]> wrote:

Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

Approach I tried to calculate count on the datastream:

1> Read data from table (say past 2 days of data) as Datastream

2> apply Key operation on the datastream

3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0

vehicleId=dd, count=1, fuel=7, avgFuel=0.0

vehicleId=dd, count=2, fuel=22, avgFuel=11.0

vehicleId=dd, count=3, fuel=42, avgFuel=14.0

vehicleId=ee, count=1, fuel=0, avgFuel=0.0

vehicleId=aa, count=1, fuel=10, avgFuel=0.0

vehicleId=dd, count=3, fuel=42, avgFuel=14.0

vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,

Sunitha.

Danny Chan-2

Re: Working with bounded Datastreams - Flink 1.11.1

In SQL, you can use the over window to deduplicate the messages by the id [1], but i'm not sure if there are same semantic operators in DataStream.

[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#deduplication

[hidden email] <[hidden email]> 于2020年10月28日周三下午12:34写道：

Hi All,

Request your inputs please.

Regards,
Sunitha

On Tuesday, October 27, 2020, 01:01:41 PM GMT+5:30, [hidden email] <[hidden email]> wrote:

Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

Approach I tried to calculate count on the datastream:
1> Read data from table (say past 2 days of data) as Datastream
2> apply Key operation on the datastream
3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=1, fuel=7, avgFuel=0.0
vehicleId=dd, count=2, fuel=22, avgFuel=11.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below:
vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,
Sunitha.

Arvid Heise-3

Re: Working with bounded Datastreams - Flink 1.11.1

Hi Sunitha,

you probably need to apply a non-windowed grouping.

datastream
   .keyBy(Event::getVehicleId)
   .reduce((first, other) -> first);

This example will always throw away the second record. You may want to combine the records though by summing up the fuel.

Best,

Arvid

On Wed, Oct 28, 2020 at 8:47 AM Danny Chan <[hidden email]> wrote:

In SQL, you can use the over window to deduplicate the messages by the id [1], but i'm not sure if there are same semantic operators in DataStream.

[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#deduplication

[hidden email] <[hidden email]> 于2020年10月28日周三下午12:34写道：

Hi All,

Request your inputs please.

Regards,
Sunitha

On Tuesday, October 27, 2020, 01:01:41 PM GMT+5:30, [hidden email] <[hidden email]> wrote:

Hi Team,

I want to use Flink Datastream for Batch operations which involves huge data, I did try to calculate count and average on the whole Datastream with out using window function.

Approach I tried to calculate count on the datastream:
1> Read data from table (say past 2 days of data) as Datastream
2> apply Key operation on the datastream
3> then use reduce function to find count, sum and average.

I have written output to file and also inserted into table: sample data from file is:

vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=1, fuel=7, avgFuel=0.0
vehicleId=dd, count=2, fuel=22, avgFuel=11.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

what I am looking for is , when there are multiple records with same vehicle Id I see that only the final record is having correct values (like vehicleId=dd). Is there any way to get only one final record for each vehicle as shown below:
vehicleId=aa, count=1, fuel=10, avgFuel=0.0
vehicleId=dd, count=3, fuel=42, avgFuel=14.0
vehicleId=ee, count=1, fuel=0, avgFuel=0.0

Also I request some help on how to sort whole DataStream based on one attribute. Say we have x records in one Batch Job I would like to sort and fetch X-2 position record per vehicle.

Regards,
Sunitha.

Arvid Heise | Senior Java Developer

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng