REST: reading completed jobs' details

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

REST: reading completed jobs' details

Miguel Coimbra
Hello,

I'm having difficulty reading the status (such as time taken for each dataflow operator in a job) of jobs that have completed.

First, when I click on "Completed jobs" on the web interface (by default at 8081), no job shows up.
I see jobs that exist as "Running", but as soon as they finish, I would expect them to appear in the "Complete jobs" section, but no luck.

Consider that I am running locally (web UI is running, I checked and it is available via browser) on 8081.
None of these links worked for checking jobs that have already finished, such as the job ID 618fac9da6ea458f5091a9c40e54cbcc that had been running:


I'm running with a LocalExecutionEnvironment with with the method:
ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
I hope anyone may be able to help.

Best,


Reply | Threaded
Open this post in threaded view
|

Re: REST: reading completed jobs' details

Chesnay Schepler
When you create an environment that way, then the cluster is shutdown once the job completes.
The WebUI can _appear_ as still working since all the files, and data about the job, is cached in the browser.

On 05.09.2018 17:39, Miguel Coimbra wrote:
Hello,

I'm having difficulty reading the status (such as time taken for each dataflow operator in a job) of jobs that have completed.

First, when I click on "Completed jobs" on the web interface (by default at 8081), no job shows up.
I see jobs that exist as "Running", but as soon as they finish, I would expect them to appear in the "Complete jobs" section, but no luck.

Consider that I am running locally (web UI is running, I checked and it is available via browser) on 8081.
None of these links worked for checking jobs that have already finished, such as the job ID 618fac9da6ea458f5091a9c40e54cbcc that had been running:


I'm running with a LocalExecutionEnvironment with with the method:
ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
I hope anyone may be able to help.

Best,



Reply | Threaded
Open this post in threaded view
|

Re: REST: reading completed jobs' details

Miguel Coimbra
Thanks for the reply.

However, I think my case differs because I am running a sequence of independent Flink jobs on the same environment instance.
I only create the LocalExecutionEnvironment once.

The web manager shows the job ID changing correctly every time a new job is executed.

Since it is the same execution environment (and therefore the same cluster instance I imagine), those completed jobs should show as well, no?

On Wed, 5 Sep 2018 at 18:40, Chesnay Schepler <[hidden email]> wrote:
When you create an environment that way, then the cluster is shutdown once the job completes.
The WebUI can _appear_ as still working since all the files, and data about the job, is cached in the browser.

On 05.09.2018 17:39, Miguel Coimbra wrote:
Hello,

I'm having difficulty reading the status (such as time taken for each dataflow operator in a job) of jobs that have completed.

First, when I click on "Completed jobs" on the web interface (by default at 8081), no job shows up.
I see jobs that exist as "Running", but as soon as they finish, I would expect them to appear in the "Complete jobs" section, but no luck.

Consider that I am running locally (web UI is running, I checked and it is available via browser) on 8081.
None of these links worked for checking jobs that have already finished, such as the job ID 618fac9da6ea458f5091a9c40e54cbcc that had been running:


I'm running with a LocalExecutionEnvironment with with the method:
ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
I hope anyone may be able to help.

Best,



Reply | Threaded
Open this post in threaded view
|

Re: REST: reading completed jobs' details

Chesnay Schepler
No, the cluster isn't shared. For each job a separate cluster is spun up when calling execute(), at the end of which it is shut down.

For explicitly creation and shutdown of a cluster I would suggest to execute your jobs as a test that contains a MiniClusterResource.

On 05.09.2018 20:59, Miguel Coimbra wrote:
Thanks for the reply.

However, I think my case differs because I am running a sequence of independent Flink jobs on the same environment instance.
I only create the LocalExecutionEnvironment once.

The web manager shows the job ID changing correctly every time a new job is executed.

Since it is the same execution environment (and therefore the same cluster instance I imagine), those completed jobs should show as well, no?

On Wed, 5 Sep 2018 at 18:40, Chesnay Schepler <[hidden email]> wrote:
When you create an environment that way, then the cluster is shutdown once the job completes.
The WebUI can _appear_ as still working since all the files, and data about the job, is cached in the browser.

On 05.09.2018 17:39, Miguel Coimbra wrote:
Hello,

I'm having difficulty reading the status (such as time taken for each dataflow operator in a job) of jobs that have completed.

First, when I click on "Completed jobs" on the web interface (by default at 8081), no job shows up.
I see jobs that exist as "Running", but as soon as they finish, I would expect them to appear in the "Complete jobs" section, but no luck.

Consider that I am running locally (web UI is running, I checked and it is available via browser) on 8081.
None of these links worked for checking jobs that have already finished, such as the job ID 618fac9da6ea458f5091a9c40e54cbcc that had been running:


I'm running with a LocalExecutionEnvironment with with the method:
ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
I hope anyone may be able to help.

Best,




Reply | Threaded
Open this post in threaded view
|

Re: REST: reading completed jobs' details

Miguel Coimbra
Hello Chesnay,

Thanks for the information.

Decided to move straight away to launching a standalone cluster.
I'm now having another problem when trying to submit a job through my Java program after launching the standalone cluster.

I configured the cluster (flink-1.6.0/conf/flink-conf.yaml) to use 2 TaskManager instances and assigned port ranges for most Flink cluster entities (to avoid port collisions with more than 1 TaskManager):

query.server.ports: 30000-35000
query.proxy.ports: 35001-40000
taskmanager.rpc.port: 45001-50000
taskmanager.data.port: 50001-55000
blob.server.port: 55001-60000


I'm launching in Linux with:

./start-cluster.sh

Starting cluster.
Starting standalonesession daemon on host xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.
[INFO] 1 instance(s) of taskexecutor are already running on xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.


However, my Java program ends up hanging as soon as I perform an execute() call (for example by calling count() on a DataSet).

Checking the JobManager log, I find the following exception whenever my Java program calls execute() over the ExecutionEnvironment (either using Maven on the terminal or from IntelliJ IDEA):

WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [/127.0.0.1:47774] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 1347375960 - discarded

I checked that the problem is happening on a count(), so I don't think it has to do with the JobManager/TaskManagers trying to exchange excessively-big messages.

While searching, I tried to make sure my program compiles with the same library versions as those in this cluster version of Flink.


I downloaded the Apache Flink 1.6 binaries to launch the cluster:




I then checked the library versions used in the pom.xml of the 1.6.0 branch of the Flink repository:



On my project's pom.xml, I have the following:

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<flink.version>1.6.0</flink.version>
<slf4j.version>1.7.7</slf4j.version>
<log4j.version>1.2.17</log4j.version>
<scala.version>2.11.12</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<akka.version>2.4.20</akka.version>
<junit.version>4.12</junit.version>
<junit.jupiter.version>5.0.0</junit.jupiter.version>
<junit.vintage.version>${junit.version}.1</junit.vintage.version>
<junit.platform.version>1.0.1</junit.platform.version>
<aspectj.version>1.9.1</aspectj.version>
</properties>

My project's dependency versions match those of the Flink 1.6 repository (for libraries such as akka).
However, I'm having difficulty understanding what else may be causing this problem.

Thanks for your attention.

Best,

On Wed, 5 Sep 2018 at 20:18, Chesnay Schepler <[hidden email]> wrote:
No, the cluster isn't shared. For each job a separate cluster is spun up when calling execute(), at the end of which it is shut down.

For explicitly creation and shutdown of a cluster I would suggest to execute your jobs as a test that contains a MiniClusterResource.

On 05.09.2018 20:59, Miguel Coimbra wrote:
Thanks for the reply.

However, I think my case differs because I am running a sequence of independent Flink jobs on the same environment instance.
I only create the LocalExecutionEnvironment once.

The web manager shows the job ID changing correctly every time a new job is executed.

Since it is the same execution environment (and therefore the same cluster instance I imagine), those completed jobs should show as well, no?

On Wed, 5 Sep 2018 at 18:40, Chesnay Schepler <[hidden email]> wrote:
When you create an environment that way, then the cluster is shutdown once the job completes.
The WebUI can _appear_ as still working since all the files, and data about the job, is cached in the browser.

On 05.09.2018 17:39, Miguel Coimbra wrote:
Hello,

I'm having difficulty reading the status (such as time taken for each dataflow operator in a job) of jobs that have completed.

First, when I click on "Completed jobs" on the web interface (by default at 8081), no job shows up.
I see jobs that exist as "Running", but as soon as they finish, I would expect them to appear in the "Complete jobs" section, but no luck.

Consider that I am running locally (web UI is running, I checked and it is available via browser) on 8081.
None of these links worked for checking jobs that have already finished, such as the job ID 618fac9da6ea458f5091a9c40e54cbcc that had been running:


I'm running with a LocalExecutionEnvironment with with the method:
ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
I hope anyone may be able to help.

Best,




Reply | Threaded
Open this post in threaded view
|

Re: REST: reading completed jobs' details

Chesnay Schepler
Did you by chance use the RemoteEnvironment and pass in 6123 as the port? If so, try using 8081 instead, which is the REST port.

On 06.09.2018 18:24, Miguel Coimbra wrote:
Hello Chesnay,

Thanks for the information.

Decided to move straight away to launching a standalone cluster.
I'm now having another problem when trying to submit a job through my Java program after launching the standalone cluster.

I configured the cluster (flink-1.6.0/conf/flink-conf.yaml) to use 2 TaskManager instances and assigned port ranges for most Flink cluster entities (to avoid port collisions with more than 1 TaskManager):

query.server.ports: 30000-35000
query.proxy.ports: 35001-40000
taskmanager.rpc.port: 45001-50000
taskmanager.data.port: 50001-55000
blob.server.port: 55001-60000


I'm launching in Linux with:

./start-cluster.sh

Starting cluster.
Starting standalonesession daemon on host xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.
[INFO] 1 instance(s) of taskexecutor are already running on xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.


However, my Java program ends up hanging as soon as I perform an execute() call (for example by calling count() on a DataSet).

Checking the JobManager log, I find the following exception whenever my Java program calls execute() over the ExecutionEnvironment (either using Maven on the terminal or from IntelliJ IDEA):

WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [/127.0.0.1:47774] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 1347375960 - discarded

I checked that the problem is happening on a count(), so I don't think it has to do with the JobManager/TaskManagers trying to exchange excessively-big messages.

While searching, I tried to make sure my program compiles with the same library versions as those in this cluster version of Flink.


I downloaded the Apache Flink 1.6 binaries to launch the cluster:




I then checked the library versions used in the pom.xml of the 1.6.0 branch of the Flink repository:



On my project's pom.xml, I have the following:

<properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
   <maven.compiler.source>1.8</maven.compiler.source>
   <maven.compiler.target>1.8</maven.compiler.target>
   <flink.version>1.6.0</flink.version>
   <slf4j.version>1.7.7</slf4j.version>
   <log4j.version>1.2.17</log4j.version>
   <scala.version>2.11.12</scala.version>
   <scala.binary.version>2.11</scala.binary.version>
   <akka.version>2.4.20</akka.version>
   <junit.version>4.12</junit.version>
   <junit.jupiter.version>5.0.0</junit.jupiter.version>
   <junit.vintage.version>${junit.version}.1</junit.vintage.version>
   <junit.platform.version>1.0.1</junit.platform.version>
   <aspectj.version>1.9.1</aspectj.version>
</properties>

My project's dependency versions match those of the Flink 1.6 repository (for libraries such as akka).
However, I'm having difficulty understanding what else may be causing this problem.

Thanks for your attention.

Best,

On Wed, 5 Sep 2018 at 20:18, Chesnay Schepler <[hidden email]> wrote:
No, the cluster isn't shared. For each job a separate cluster is spun up when calling execute(), at the end of which it is shut down.

For explicitly creation and shutdown of a cluster I would suggest to execute your jobs as a test that contains a MiniClusterResource.

On 05.09.2018 20:59, Miguel Coimbra wrote:
Thanks for the reply.

However, I think my case differs because I am running a sequence of independent Flink jobs on the same environment instance.
I only create the LocalExecutionEnvironment once.

The web manager shows the job ID changing correctly every time a new job is executed.

Since it is the same execution environment (and therefore the same cluster instance I imagine), those completed jobs should show as well, no?

On Wed, 5 Sep 2018 at 18:40, Chesnay Schepler <[hidden email]> wrote:
When you create an environment that way, then the cluster is shutdown once the job completes.
The WebUI can _appear_ as still working since all the files, and data about the job, is cached in the browser.

On 05.09.2018 17:39, Miguel Coimbra wrote:
Hello,

I'm having difficulty reading the status (such as time taken for each dataflow operator in a job) of jobs that have completed.

First, when I click on "Completed jobs" on the web interface (by default at 8081), no job shows up.
I see jobs that exist as "Running", but as soon as they finish, I would expect them to appear in the "Complete jobs" section, but no luck.

Consider that I am running locally (web UI is running, I checked and it is available via browser) on 8081.
None of these links worked for checking jobs that have already finished, such as the job ID 618fac9da6ea458f5091a9c40e54cbcc that had been running:


I'm running with a LocalExecutionEnvironment with with the method:
ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
I hope anyone may be able to help.

Best,





Reply | Threaded
Open this post in threaded view
|

Re: REST: reading completed jobs' details

Miguel Coimbra
Exactly, that was the problem.
Didn't realize the restructured cluster channels all communications to the REST port.

Thanks again.

Best,

On Thu, 6 Sep 2018 at 17:57, Chesnay Schepler <[hidden email]> wrote:
Did you by chance use the RemoteEnvironment and pass in 6123 as the port? If so, try using 8081 instead, which is the REST port.

On 06.09.2018 18:24, Miguel Coimbra wrote:
Hello Chesnay,

Thanks for the information.

Decided to move straight away to launching a standalone cluster.
I'm now having another problem when trying to submit a job through my Java program after launching the standalone cluster.

I configured the cluster (flink-1.6.0/conf/flink-conf.yaml) to use 2 TaskManager instances and assigned port ranges for most Flink cluster entities (to avoid port collisions with more than 1 TaskManager):

query.server.ports: 30000-35000
query.proxy.ports: 35001-40000
taskmanager.rpc.port: 45001-50000
taskmanager.data.port: 50001-55000
blob.server.port: 55001-60000


I'm launching in Linux with:

./start-cluster.sh

Starting cluster.
Starting standalonesession daemon on host xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.
[INFO] 1 instance(s) of taskexecutor are already running on xxxxxxx.
Starting taskexecutor daemon on host xxxxxxx.


However, my Java program ends up hanging as soon as I perform an execute() call (for example by calling count() on a DataSet).

Checking the JobManager log, I find the following exception whenever my Java program calls execute() over the ExecutionEnvironment (either using Maven on the terminal or from IntelliJ IDEA):

WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [/127.0.0.1:47774] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 1347375960 - discarded

I checked that the problem is happening on a count(), so I don't think it has to do with the JobManager/TaskManagers trying to exchange excessively-big messages.

While searching, I tried to make sure my program compiles with the same library versions as those in this cluster version of Flink.


I downloaded the Apache Flink 1.6 binaries to launch the cluster:




I then checked the library versions used in the pom.xml of the 1.6.0 branch of the Flink repository:



On my project's pom.xml, I have the following:

<properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
   <maven.compiler.source>1.8</maven.compiler.source>
   <maven.compiler.target>1.8</maven.compiler.target>
   <flink.version>1.6.0</flink.version>
   <slf4j.version>1.7.7</slf4j.version>
   <log4j.version>1.2.17</log4j.version>
   <scala.version>2.11.12</scala.version>
   <scala.binary.version>2.11</scala.binary.version>
   <akka.version>2.4.20</akka.version>
   <junit.version>4.12</junit.version>
   <junit.jupiter.version>5.0.0</junit.jupiter.version>
   <junit.vintage.version>${junit.version}.1</junit.vintage.version>
   <junit.platform.version>1.0.1</junit.platform.version>
   <aspectj.version>1.9.1</aspectj.version>
</properties>

My project's dependency versions match those of the Flink 1.6 repository (for libraries such as akka).
However, I'm having difficulty understanding what else may be causing this problem.

Thanks for your attention.

Best,

On Wed, 5 Sep 2018 at 20:18, Chesnay Schepler <[hidden email]> wrote:
No, the cluster isn't shared. For each job a separate cluster is spun up when calling execute(), at the end of which it is shut down.

For explicitly creation and shutdown of a cluster I would suggest to execute your jobs as a test that contains a MiniClusterResource.

On 05.09.2018 20:59, Miguel Coimbra wrote:
Thanks for the reply.

However, I think my case differs because I am running a sequence of independent Flink jobs on the same environment instance.
I only create the LocalExecutionEnvironment once.

The web manager shows the job ID changing correctly every time a new job is executed.

Since it is the same execution environment (and therefore the same cluster instance I imagine), those completed jobs should show as well, no?

On Wed, 5 Sep 2018 at 18:40, Chesnay Schepler <[hidden email]> wrote:
When you create an environment that way, then the cluster is shutdown once the job completes.
The WebUI can _appear_ as still working since all the files, and data about the job, is cached in the browser.

On 05.09.2018 17:39, Miguel Coimbra wrote:
Hello,

I'm having difficulty reading the status (such as time taken for each dataflow operator in a job) of jobs that have completed.

First, when I click on "Completed jobs" on the web interface (by default at 8081), no job shows up.
I see jobs that exist as "Running", but as soon as they finish, I would expect them to appear in the "Complete jobs" section, but no luck.

Consider that I am running locally (web UI is running, I checked and it is available via browser) on 8081.
None of these links worked for checking jobs that have already finished, such as the job ID 618fac9da6ea458f5091a9c40e54cbcc that had been running:


I'm running with a LocalExecutionEnvironment with with the method:
ExecutionEnvironment.createLocalEnvironmentWithWebUI(conf)
I hope anyone may be able to help.

Best,