(DEPRECATED) Apache Flink User Mailing List archive.

Benchmarking Apache Flink via Query Plan

Classic

List

Threaded

2 messages Options

giacomo90@libero.it

Benchmarking Apache Flink via Query Plan

Plus, I'm currently using 1.1.2 and I cannot change version due to dependency
problems.
Thanks in advance,

Giacomo90

>----Messaggio originale----
>From: "[hidden email]" <[hidden email]>
>Date: 21/04/2017 17.42
>To: <[hidden email]>
>Subj: R: WELCOME to [hidden email]
>
>Dear Users and Apache Flink devs,
>
> For each one of my distributed computation, I'm generating and
>reading the json files produced by the getExecutionPlan() in order to

motivate
>my benchmarks. Is there some guide providing an explaination of the exact
>meaning of the fields of the generated JSON file? I'm trying to
differentiate

>from the timing result which part of the computation time was spent sending
>messages and which time was spent during either I/O or CPU operations.
> By the way, I also noticed that I do not get any information
>concerning the actual data that is been used and transmitted throughout the
>network (the actual data size and the messages' data size).
> Moreover, currently I'm using the following way to get the JSON file
>
>> createAndRegisterDataSinks();
>> String plan = globalEnvironment.getExecutionPlan();
>> createAndRegisterDataSinks();
>> globalEnvironment.execute(getClass().getSimpleName()); // Running the

actual
>class
>
> Is there a better way to do it?
> Thanks in advance for your support,
>
> Giacomo90
>

Fabian Hueske-2

Re: Benchmarking Apache Flink via Query Plan

Hi Giacomo90,

I'm not aware of a detailed description of the execution plan.

The plan can be used to identify the execution strategies (shipping and local) chosen by the optimizer and some properties of the data (partitioning, order).

Common shipping strategies can be FORWARD (locally forwarding, no network) and HASH_PARTITION (shuffling by key).

Common local strategies are SORT (sorts the data set), HASH_FIRST_BUILD (creates a hash table from the first input and probes the other second input), SORT_MERGE (sort merge join, requires both inputs to be sorted). There are a few more strategies.

Note that at operators in the plan can be chained together when the program is executed and will appear as a single node.

The plan does also not contain any details about the data sizes (if you see some numbers there, those are mostly inaccurate estimates).

The web dashboard shows some metrics on the processed data volumes.

Btw. You can visualize the JSON with this online tool [1].

Best, Fabian

[1] http://flink.apache.org/visualizer/

2017-04-22 16:29 GMT+02:00 [hidden email] <[hidden email]>:

Plus, I'm currently using 1.1.2 and I cannot change version due to dependency
problems.
Thanks in advance,

Giacomo90

>----Messaggio originale----
>From: "[hidden email]" <[hidden email]>
>Date: 21/04/2017 17.42
>To: <[hidden email]>
>Subj: R: WELCOME to [hidden email]
>
>Dear Users and Apache Flink devs,
>
> For each one of my distributed computation, I'm generating and
>reading the json files produced by the getExecutionPlan() in order to
motivate
>my benchmarks. Is there some guide providing an explaination of the exact
>meaning of the fields of the generated JSON file? I'm trying to
differentiate
>from the timing result which part of the computation time was spent sending
>messages and which time was spent during either I/O or CPU operations.
> By the way, I also noticed that I do not get any information
>concerning the actual data that is been used and transmitted throughout the
>network (the actual data size and the messages' data size).
> Moreover, currently I'm using the following way to get the JSON file
>
>> createAndRegisterDataSinks();
>> String plan = globalEnvironment.getExecutionPlan();
>> createAndRegisterDataSinks();
>> globalEnvironment.execute(getClass().getSimpleName()); // Running the
actual
>class
>
> Is there a better way to do it?
> Thanks in advance for your support,
>
> Giacomo90
>