Posted by
Rockstar Flo on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/long-runtime-tp104p108.html
Thanks for your quick answer.
In the following, I roughly sketch the mass-join algorithm.
http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
It's a R-S-Join which i modified to a self-join.
Given a set of token sets. The massJoin finds all similar sets
(regarding to the Jaccard Similarity(intersection/union))
First, it calculates a global token grouping, i.e., each to token is
grouped in one of 30 groups. Each group has almost the same token
count.
Than, it generates two types of signatures for each input set.
If two sets are similar, they must share a common signature.
In the next step, we find all candidate pairs (pairs which share a
common signature).
Some candidate pairs are filtered using the global token grouping.
The remaining candidate pairs are verified to filter out all
dissimilar pairs.
@Fabian
I specified the DOP via the command-line client as follows:
/home/hoenicke/flink-0.6-incubating/bin/flink run -p 11
/home/hoenicke/flink-0.6-incubating/jar/mass6.jar 0.9
\
file:///home/hoenicke/flink-0.6-incubating/input/inputNummeriert.txt
file:///home/hoenicke/flink-0.6-incubating/output -v
The log file is attached.
Best, Florian
Am 24.09.2014 um 22:45 schrieb Fabian
Hueske:
Hi,
how did you specify the degree of parallelism DOP for your
program?
Via the command-line client or system-configuration or
otherwise?
The JobManager log file (./log/*jobManager*.log) contains
you the DOP of each task.
Best, Fabian