Hello flink community,
I am trying to understand the latency involved in cross operation. Below are my tests. In plain Java: 1. Create 2D array 1 - populated with 1 million rows and 3 columns with randomly generated double values 2. Create 2D array 1 - populated with 100 rows and 3 columns with randomly generated double values 3. Run nested for loop for 1 million X 100 times and perform EuclideanDistance calculation inside the nested loop 4. Collect the output in a List of doubles and print size of the list at last. above steps are complete in about 15 seconds in plain java on my laptop. In flink batch: 1. Read avro files with 1 million and 100 rows in same format as above 2. Perform cross operation from 100 rows dataset with 1 million row with crossWithHuge hint as the broadcasted 1 million dataset is bigger in this case. 3. Apply map function that will perform distance function. 4. After cross I am doing a count at the end as a closure step. When I package and submit jar to flink cluster it takes about 2 min and 10 sec to complete. I can see that 1 million dataset finishes population from avro file in a minute and its indicated as broadcast which makes sense. Since I am running it on a single slot I believe there is not data shipped across the network. I am wondering why it still takes another 70 seconds to run cross operation. I understand cartesian product can be expensive but I am guessing it should be close to the nested loop in Java for this case. Please advise. Thanks for your help in advance! Regards, Varun -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Varun, The focus of the DataSet execution is on robustness. The smaller DataSet is stored serialized in memory.2018-05-11 3:16 GMT+02:00 Varun Dhore <[hidden email]>: Hello flink community, |
Free forum by Nabble | Edit this page |