Jar Uploads in High Availability (Flink 1.7.2)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Jar Uploads in High Availability (Flink 1.7.2)

Martin, Nick-2

I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Jar Uploads in High Availability (Flink 1.7.2)

Ravi Bhushan Ratnakar
Hi,

i was also experiencing with the similar behavior. I adopted following approach 
  •  used a distributed file system(in my case aws efs) and set the attribute "web.upload.dir", this way both the job manager have same location.
  • on the load balancer side(aws elb), i used "readiness probe" based on zookeeper entry for active jobmanager address, this way elb always point to the active job manager and if the active jobmanager changes then it automatically point to the new active jobmanager and as both are using the same location by configuring distributed file system so new active job is able to find the same jar.

Regards,
Ravi

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <[hidden email]> wrote:

I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

RE: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Martin, Nick-2

Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do it that’s easier. If I have to implement the zookeeper lookups in my load balancer myself, that feels like a definite step backwards from the pre-1.5 days when the cluster would give 307 redirects to the current leader

 

From: Ravi Bhushan Ratnakar [mailto:[hidden email]]
Sent: Tuesday, October 15, 2019 10:35 PM
To: Martin, Nick J [US] (IS) <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

 

Hi,

 

i was also experiencing with the similar behavior. I adopted following approach 

  •  used a distributed file system(in my case aws efs) and set the attribute "web.upload.dir", this way both the job manager have same location.
  • on the load balancer side(aws elb), i used "readiness probe" based on zookeeper entry for active jobmanager address, this way elb always point to the active job manager and if the active jobmanager changes then it automatically point to the new active jobmanager and as both are using the same location by configuring distributed file system so new active job is able to find the same jar.

 

Regards,

Ravi

 

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <[hidden email]> wrote:

I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Till Rohrmann
Hi Martin,

Flink's web UI based job submission is not well suited to be run behind a load balancer at the moment. The problem is that the web based job submission is actually a two phase operation: Uploading the jars and then starting the job. Since Flink's RestServer stores the uploaded files locally, it is required that the web submission is executed on the same RestServer to which you also uploaded the files before. Note, however, that the cli client job submission is not affected by this since the job graph upload and submission is one request.

A workaround to make the uploads accessible to all RestServers is to configure a DFS for the `web.upload.dir` as Ravi suggested or to use Flink's CLI to submit jobs instead.

A quick note about the old behaviour with the redirects. The redirects actually defied the purpose of load balancers because all requests were redirected to a single RestServer instance. Hence, running it with or w/o load balancer should not have made a big difference.

Cheers,
Till

On Wed, Oct 16, 2019 at 5:58 PM Martin, Nick J [US] (IS) <[hidden email]> wrote:

Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do it that’s easier. If I have to implement the zookeeper lookups in my load balancer myself, that feels like a definite step backwards from the pre-1.5 days when the cluster would give 307 redirects to the current leader

 

From: Ravi Bhushan Ratnakar [mailto:[hidden email]]
Sent: Tuesday, October 15, 2019 10:35 PM
To: Martin, Nick J [US] (IS) <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

 

Hi,

 

i was also experiencing with the similar behavior. I adopted following approach 

  •  used a distributed file system(in my case aws efs) and set the attribute "web.upload.dir", this way both the job manager have same location.
  • on the load balancer side(aws elb), i used "readiness probe" based on zookeeper entry for active jobmanager address, this way elb always point to the active job manager and if the active jobmanager changes then it automatically point to the new active jobmanager and as both are using the same location by configuring distributed file system so new active job is able to find the same jar.

 

Regards,

Ravi

 

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <[hidden email]> wrote:

I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

tison
FYI there is already a corresponding issue https://issues.apache.org/jira/browse/FLINK-13660

Best,
tison.


Till Rohrmann <[hidden email]> 于2019年10月18日周五 下午9:42写道:
Hi Martin,

Flink's web UI based job submission is not well suited to be run behind a load balancer at the moment. The problem is that the web based job submission is actually a two phase operation: Uploading the jars and then starting the job. Since Flink's RestServer stores the uploaded files locally, it is required that the web submission is executed on the same RestServer to which you also uploaded the files before. Note, however, that the cli client job submission is not affected by this since the job graph upload and submission is one request.

A workaround to make the uploads accessible to all RestServers is to configure a DFS for the `web.upload.dir` as Ravi suggested or to use Flink's CLI to submit jobs instead.

A quick note about the old behaviour with the redirects. The redirects actually defied the purpose of load balancers because all requests were redirected to a single RestServer instance. Hence, running it with or w/o load balancer should not have made a big difference.

Cheers,
Till

On Wed, Oct 16, 2019 at 5:58 PM Martin, Nick J [US] (IS) <[hidden email]> wrote:

Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do it that’s easier. If I have to implement the zookeeper lookups in my load balancer myself, that feels like a definite step backwards from the pre-1.5 days when the cluster would give 307 redirects to the current leader

 

From: Ravi Bhushan Ratnakar [mailto:[hidden email]]
Sent: Tuesday, October 15, 2019 10:35 PM
To: Martin, Nick J [US] (IS) <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

 

Hi,

 

i was also experiencing with the similar behavior. I adopted following approach 

  •  used a distributed file system(in my case aws efs) and set the attribute "web.upload.dir", this way both the job manager have same location.
  • on the load balancer side(aws elb), i used "readiness probe" based on zookeeper entry for active jobmanager address, this way elb always point to the active job manager and if the active jobmanager changes then it automatically point to the new active jobmanager and as both are using the same location by configuring distributed file system so new active job is able to find the same jar.

 

Regards,

Ravi

 

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <[hidden email]> wrote:

I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

RE: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

Martin, Nick-2
In reply to this post by Till Rohrmann

So I think what you’re saying is if I use a DFS for web.upload.dir, my clients can send all their requests to any Job Manager instance and not worry or care which one is the leader. That definitely is an improvement, thanks.

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Friday, October 18, 2019 6:42 AM
To: Martin, Nick J [US] (IS) <[hidden email]>
Cc: Ravi Bhushan Ratnakar <[hidden email]>; user <[hidden email]>
Subject: Re: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

 

Hi Martin,

 

Flink's web UI based job submission is not well suited to be run behind a load balancer at the moment. The problem is that the web based job submission is actually a two phase operation: Uploading the jars and then starting the job. Since Flink's RestServer stores the uploaded files locally, it is required that the web submission is executed on the same RestServer to which you also uploaded the files before. Note, however, that the cli client job submission is not affected by this since the job graph upload and submission is one request.

 

A workaround to make the uploads accessible to all RestServers is to configure a DFS for the `web.upload.dir` as Ravi suggested or to use Flink's CLI to submit jobs instead.

 

A quick note about the old behaviour with the redirects. The redirects actually defied the purpose of load balancers because all requests were redirected to a single RestServer instance. Hence, running it with or w/o load balancer should not have made a big difference.

 

Cheers,

Till

 

On Wed, Oct 16, 2019 at 5:58 PM Martin, Nick J [US] (IS) <[hidden email]> wrote:

Yeah, I’ll do that if I have to. I’m hoping there’s a ‘right’ way to do it that’s easier. If I have to implement the zookeeper lookups in my load balancer myself, that feels like a definite step backwards from the pre-1.5 days when the cluster would give 307 redirects to the current leader

 

From: Ravi Bhushan Ratnakar [mailto:[hidden email]]
Sent: Tuesday, October 15, 2019 10:35 PM
To: Martin, Nick J [US] (IS) <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: Jar Uploads in High Availability (Flink 1.7.2)

 

Hi,

 

i was also experiencing with the similar behavior. I adopted following approach 

  •  used a distributed file system(in my case aws efs) and set the attribute "web.upload.dir", this way both the job manager have same location.
  • on the load balancer side(aws elb), i used "readiness probe" based on zookeeper entry for active jobmanager address, this way elb always point to the active job manager and if the active jobmanager changes then it automatically point to the new active jobmanager and as both are using the same location by configuring distributed file system so new active job is able to find the same jar.

 

Regards,

Ravi

 

On Wed, Oct 16, 2019 at 1:15 AM Martin, Nick J [US] (IS) <[hidden email]> wrote:

I’m seeing that when I upload a jar through the rest API, it looks like only the Jobmanager that received the upload request is aware of the newly uploaded jar. That worked fine for me in older versions where all clients were redirected to connect to the leader, but now that each Jobmanager accepts requests, if I send a jar upload request, it could end up on any one (and only one) of the Jobmanagers, not necessarily the leader. Further, each Jobmanager responds to a GET request on the /jars endpoint with its own local list of jars. If I try and use one of the Jar IDs from that request, my next request may not go to the same Jobmanager (requests are going through Docker and being load-balanced), and so the Jar ID isn’t found on the new Jobmanager handling that request.