Collecting operators real output cardinalities as json files

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Collecting operators real output cardinalities as json files

Francesco Ventura
Hi everybody,

I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute().

There is a similar way to extract the real output cardinalities of each pipeline after the execution?
Is there a place where the Flink cluster stores the history of the information about executed jobs?
Developing a REST client to extract such information is the only way possible?

I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.

Maybe is a trivial problem but I have a quick look around and I can not find the solution.

Thank you very much,

Francesco
Reply | Threaded
Open this post in threaded view
|

Re: Collecting operators real output cardinalities as json files

Piotr Nowojski-3
Hi Francesco,

Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.

Piotrek


On 23 May 2020, at 11:31, Francesco Ventura <[hidden email]> wrote:

Hi everybody,

I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute().

There is a similar way to extract the real output cardinalities of each pipeline after the execution?
Is there a place where the Flink cluster stores the history of the information about executed jobs?
Developing a REST client to extract such information is the only way possible?

I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.

Maybe is a trivial problem but I have a quick look around and I can not find the solution.

Thank you very much,

Francesco

Reply | Threaded
Open this post in threaded view
|

Re: Collecting operators real output cardinalities as json files

Francesco Ventura
Hi Piotrek,

Thank you for you replay and for your suggestions. Just another doubt.
Does the usage of metrics reporter and custom metrics will affect the performances of the running jobs in term of execution time? Since I need the information about the exact netRunTime of each job maybe using the REST APIs to get the other information will be more reliable?

Thank you. Best,

Francesco

Il giorno 25 mag 2020, alle ore 19:54, Piotr Nowojski <[hidden email]> ha scritto:

Hi Francesco,

Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.

Piotrek


On 23 May 2020, at 11:31, Francesco Ventura <[hidden email]> wrote:

Hi everybody,

I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute().

There is a similar way to extract the real output cardinalities of each pipeline after the execution?
Is there a place where the Flink cluster stores the history of the information about executed jobs?
Developing a REST client to extract such information is the only way possible?

I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.

Maybe is a trivial problem but I have a quick look around and I can not find the solution.

Thank you very much,

Francesco


Reply | Threaded
Open this post in threaded view
|

Re: Collecting operators real output cardinalities as json files

Piotr Nowojski-3
Hi Francesco,

As long as you do not set update interval of metric reporter to some very low value, there should be no visible performance degradation.

Maybe worth keeping in mind is that if you jobs are bounded (they are working on bounded input and they finish/complete at some point of time), the last updated metric value before job completes might not necessarily reflect the end state of the job. This limitation may not apply if you will be using REST API, as Job Manager might be remembering the values you are looking for.

Piotrek

On 27 May 2020, at 11:41, Francesco Ventura <[hidden email]> wrote:

Hi Piotrek,

Thank you for you replay and for your suggestions. Just another doubt.
Does the usage of metrics reporter and custom metrics will affect the performances of the running jobs in term of execution time? Since I need the information about the exact netRunTime of each job maybe using the REST APIs to get the other information will be more reliable?

Thank you. Best,

Francesco

Il giorno 25 mag 2020, alle ore 19:54, Piotr Nowojski <[hidden email]> ha scritto:

Hi Francesco,

Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.

Piotrek


On 23 May 2020, at 11:31, Francesco Ventura <[hidden email]> wrote:

Hi everybody,

I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute().

There is a similar way to extract the real output cardinalities of each pipeline after the execution?
Is there a place where the Flink cluster stores the history of the information about executed jobs?
Developing a REST client to extract such information is the only way possible?

I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.

Maybe is a trivial problem but I have a quick look around and I can not find the solution.

Thank you very much,

Francesco



Reply | Threaded
Open this post in threaded view
|

Re: Collecting operators real output cardinalities as json files

Francesco Ventura
Thank you very much for your explanation.
I will keep it in mind.

Best,

Francesco

Il giorno 27 mag 2020, alle ore 15:43, Piotr Nowojski <[hidden email]> ha scritto:

Hi Francesco,

As long as you do not set update interval of metric reporter to some very low value, there should be no visible performance degradation.

Maybe worth keeping in mind is that if you jobs are bounded (they are working on bounded input and they finish/complete at some point of time), the last updated metric value before job completes might not necessarily reflect the end state of the job. This limitation may not apply if you will be using REST API, as Job Manager might be remembering the values you are looking for.

Piotrek

On 27 May 2020, at 11:41, Francesco Ventura <[hidden email]> wrote:

Hi Piotrek,

Thank you for you replay and for your suggestions. Just another doubt.
Does the usage of metrics reporter and custom metrics will affect the performances of the running jobs in term of execution time? Since I need the information about the exact netRunTime of each job maybe using the REST APIs to get the other information will be more reliable?

Thank you. Best,

Francesco

Il giorno 25 mag 2020, alle ore 19:54, Piotr Nowojski <[hidden email]> ha scritto:

Hi Francesco,

Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.

Piotrek


On 23 May 2020, at 11:31, Francesco Ventura <[hidden email]> wrote:

Hi everybody,

I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute().

There is a similar way to extract the real output cardinalities of each pipeline after the execution?
Is there a place where the Flink cluster stores the history of the information about executed jobs?
Developing a REST client to extract such information is the only way possible?

I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.

Maybe is a trivial problem but I have a quick look around and I can not find the solution.

Thank you very much,

Francesco