Thanks for your quick answer.
In the following, I roughly sketch the mass-join algorithm.
http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
It's a R-S-Join which i modified to a self-join.
Given a set of token sets. The massJoin finds all similar sets (regarding to the Jaccard Similarity(intersection/union))
First, it calculates a global token grouping, i.e., each to token is grouped in one of 30 groups. Each group has almost the same token count.
Than, it generates two types of signatures for each input set.
If two sets are similar, they must share a common signature.
In the next step, we find all candidate pairs (pairs which share a common signature).
Some candidate pairs are filtered using the global token grouping.
The remaining candidate pairs are verified to filter out all dissimilar pairs.
@Fabian
I specified the DOP via the command-line client as follows:
/home/hoenicke/flink-0.6-incubating/bin/flink run -p 11 /home/hoenicke/flink-0.6-incubating/jar/mass6.jar 0.9 \file:///home/hoenicke/flink-0.6-incubating/input/inputNummeriert.txt file:///home/hoenicke/flink-0.6-incubating/output -v
The log file is attached.
Best, Florian
Am 24.09.2014 um 22:45 schrieb Fabian Hueske:
Hi,
how did you specify the degree of parallelism DOP for your program?Via the command-line client or system-configuration or otherwise?
The JobManager log file (./log/*jobManager*.log) contains you the DOP of each task.
Best, Fabian
2014-09-24 18:41 GMT+02:00 Stephan Ewen <[hidden email]>:
Hi!
Ad-hoc, that is not easy to say. It depends on your algorithm, how much data replication it does...
We'd need a bit of time to look into the code. It would help if you could roughly sketch the algorithm for us and give us a breakdown of how much time is spent in which operator (like a screenshot of the runtime web monitor).
Greetings,Stephan
On Wed, Sep 24, 2014 at 6:18 PM, Florian Hönicke <[hidden email]> wrote:
Hello :)
my Flink program is extreme slow.
I implemented a set similarity join in Flink (Mass-Join).
Furthermore, I implemented a local version in Java.
I compared both Implementations.
The Local version needs one minute to compute a 500MB Dataset.
My Flink program needs 5 minutes (cluster: 11 nodes, 20 000 MB RAM).
I use the Flink version 0.6.
What could be the cause?
I would welcome your response,
Florian Hönicke
Free forum by Nabble | Edit this page |