Latency with cross operation on Datasets

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Latency with cross operation on Datasets

Varun Dhore
Hello flink community,

I am trying to understand the latency involved in cross operation. Below are
my tests.

In plain Java:
1. Create 2D array 1 - populated with 1 million rows and 3 columns with
randomly generated double values
2. Create 2D array 1 - populated with 100 rows and 3 columns with randomly
generated double values
3. Run nested for loop for 1 million X 100 times and perform
EuclideanDistance calculation inside the nested loop
4. Collect the output in a List of doubles and print size of the list at
last.

above steps are complete in about 15 seconds in plain java on my laptop.

In flink batch:
1. Read avro files with 1 million and 100 rows in same format as above
2. Perform cross operation from 100 rows dataset with 1 million row with
crossWithHuge hint as the broadcasted 1 million dataset is bigger in this
case.
3. Apply map function that will perform distance function.
4. After cross I am doing a count at the end as a closure step.


When I package and submit jar to flink cluster it takes about 2 min and 10
sec to complete. I can see that 1 million dataset finishes population from
avro file in a minute and its indicated as broadcast which makes sense.
Since I am running it on a single slot I believe there is not data shipped
across the network. I am wondering why it still takes another 70 seconds to
run cross operation. I understand cartesian product can be expensive but I
am guessing it should be close to the nested loop in Java for this case.
Please advise.

Thanks for your help in advance!

Regards,
Varun



 





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Latency with cross operation on Datasets

Fabian Hueske-2
Hi Varun,

The focus of the DataSet execution is on robustness. The smaller DataSet is stored serialized in memory.
Also most of the communication happens via serialization (instead of passing object references).
The serialization overhead should have a significant overhead compared to a thread-local execution.

Best, Fabian

2018-05-11 3:16 GMT+02:00 Varun Dhore <[hidden email]>:
Hello flink community,

I am trying to understand the latency involved in cross operation. Below are
my tests.

In plain Java:
1. Create 2D array 1 - populated with 1 million rows and 3 columns with
randomly generated double values
2. Create 2D array 1 - populated with 100 rows and 3 columns with randomly
generated double values
3. Run nested for loop for 1 million X 100 times and perform
EuclideanDistance calculation inside the nested loop
4. Collect the output in a List of doubles and print size of the list at
last.

above steps are complete in about 15 seconds in plain java on my laptop.

In flink batch:
1. Read avro files with 1 million and 100 rows in same format as above
2. Perform cross operation from 100 rows dataset with 1 million row with
crossWithHuge hint as the broadcasted 1 million dataset is bigger in this
case.
3. Apply map function that will perform distance function.
4. After cross I am doing a count at the end as a closure step.


When I package and submit jar to flink cluster it takes about 2 min and 10
sec to complete. I can see that 1 million dataset finishes population from
avro file in a minute and its indicated as broadcast which makes sense.
Since I am running it on a single slot I believe there is not data shipped
across the network. I am wondering why it still takes another 70 seconds to
run cross operation. I understand cartesian product can be expensive but I
am guessing it should be close to the nested loop in Java for this case.
Please advise.

Thanks for your help in advance!

Regards,
Varun









--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/