(DEPRECATED) Apache Flink User Mailing List archive.

Latency with cross operation on Datasets

Posted by Varun Dhore on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Latency-with-cross-operation-on-Datasets-tp20077.html

Hello flink community,

I am trying to understand the latency involved in cross operation. Below are
my tests.

In plain Java:
1. Create 2D array 1 - populated with 1 million rows and 3 columns with
randomly generated double values
2. Create 2D array 1 - populated with 100 rows and 3 columns with randomly
generated double values
3. Run nested for loop for 1 million X 100 times and perform
EuclideanDistance calculation inside the nested loop
4. Collect the output in a List of doubles and print size of the list at
last.

above steps are complete in about 15 seconds in plain java on my laptop.

In flink batch:
1. Read avro files with 1 million and 100 rows in same format as above
2. Perform cross operation from 100 rows dataset with 1 million row with
crossWithHuge hint as the broadcasted 1 million dataset is bigger in this
case.
3. Apply map function that will perform distance function.
4. After cross I am doing a count at the end as a closure step.

When I package and submit jar to flink cluster it takes about 2 min and 10
sec to complete. I can see that 1 million dataset finishes population from
avro file in a minute and its indicated as broadcast which makes sense.
Since I am running it on a single slot I believe there is not data shipped
across the network. I am wondering why it still takes another 70 seconds to
run cross operation. I understand cartesian product can be expensive but I
am guessing it should be close to the nested loop in Java for this case.
Please advise.

Thanks for your help in advance!

Regards,
Varun

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/