Frontend classpath issue

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Frontend classpath issue

Gyula Fóra-2
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:


So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula

Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

rmetzger0
Hi,
Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.
Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:


So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula


Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

Gyula Fóra
Hi Robert,
It definitely explains the behaviour.

This only applies to the frontend right?
If so what is the rationale behind it, and how should I handle the dependency conflict?

Thanks,
Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 21:44):
Hi,
Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.
Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:


So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula


Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

rmetzger0
Mh. The user jar is put into every classpath. So the jobmanager / taskmanagers are potentially affected by this as well.
Probably the data transfer between the TMs doesn't call the same methods as the UI on the JobManager :)

The simplest solution is to shade your netty in the user jar into a different location.


On Thu, Feb 23, 2017 at 10:01 PM, Gyula Fóra <[hidden email]> wrote:
Hi Robert,
It definitely explains the behaviour.

This only applies to the frontend right?
If so what is the rationale behind it, and how should I handle the dependency conflict?

Thanks,
Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 21:44):
Hi,
Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.
Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:


So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula



Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

Gyula Fóra
Hi Robert,

I was not aware of this big change (I know it's my fault) but I am not sure if I agree with the rationale.

I read through the JIRA and it seems that this is mostly a convenience change that we dont need to copy jars and mess with the classloading that much.

On the other hand if user jars can conflict with frontend/backend classes that can lead to very serious (and hard to fix) problems, especially in larger scale deployments.

What do you think about this?

Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 22:10):
Mh. The user jar is put into every classpath. So the jobmanager / taskmanagers are potentially affected by this as well.
Probably the data transfer between the TMs doesn't call the same methods as the UI on the JobManager :)

The simplest solution is to shade your netty in the user jar into a different location.


On Thu, Feb 23, 2017 at 10:01 PM, Gyula Fóra <[hidden email]> wrote:
Hi Robert,
It definitely explains the behaviour.

This only applies to the frontend right?
If so what is the rationale behind it, and how should I handle the dependency conflict?

Thanks,
Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 21:44):
Hi,
Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.
Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:


So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula



Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

Ufuk Celebi
On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> I was not aware of this big change (I know it's my fault) but I am not sure
> if I agree with the rationale.

No comment on the actual issue from my side, but I strongly disagree
that this is your fault. We should have covered this better in the
release announcement in my opinion. Of course, this doesn't help now.
;-)

– Ufuk
Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

rmetzger0
I agree with you Gyula, this change is dangerous. I have seen another case from a user with Hadoop dependencies that crashed in Flink 1.2.0 that didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable the behavior if needed. 

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:
On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> I was not aware of this big change (I know it's my fault) but I am not sure
> if I agree with the rationale.

No comment on the actual issue from my side, but I strongly disagree
that this is your fault. We should have covered this better in the
release announcement in my opinion. Of course, this doesn't help now.
;-)

– Ufuk

Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

Aljoscha Krettek
Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>
Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

rmetzger0
The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't mention any particular user or use case.

I honestly care so much if we enable or disable it by default. But since its the new default behavior of Flink 1.2. I'm against changing that in Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so that users upgrading from 1.2.0 to 1.2.1 don't notice it.

On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]> wrote:
Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>

Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

Gyula Fóra

Hi,
I am wondering whether there is any scenario where the new way makes anything better under normal circumstances.

I can only see how it will break things in subtle ways.

If you think there is any real benefit to the current approach I dont mind having it as a default, otherwise I am in favor of reverting to the 1.1 default. (My logic is that the user will only observe a difference in behavior when the new setup actually causes problems)

Gyula


On Fri, Feb 24, 2017, 17:53 Robert Metzger <[hidden email]> wrote:
The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't mention any particular user or use case.

I honestly care so much if we enable or disable it by default. But since its the new default behavior of Flink 1.2. I'm against changing that in Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so that users upgrading from 1.2.0 to 1.2.1 don't notice it.

On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]> wrote:
Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>

Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

rmetzger0
I think the change reduces the chances to run into classloading issues in case there's a bug in Flink (= it is using the wrong CL)

I've filed a JIRA for the problem: https://issues.apache.org/jira/browse/FLINK-6031

On Fri, Feb 24, 2017 at 9:29 PM, Gyula Fóra <[hidden email]> wrote:

Hi,
I am wondering whether there is any scenario where the new way makes anything better under normal circumstances.

I can only see how it will break things in subtle ways.

If you think there is any real benefit to the current approach I dont mind having it as a default, otherwise I am in favor of reverting to the 1.1 default. (My logic is that the user will only observe a difference in behavior when the new setup actually causes problems)

Gyula


On Fri, Feb 24, 2017, 17:53 Robert Metzger <[hidden email]> wrote:
The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't mention any particular user or use case.

I honestly care so much if we enable or disable it by default. But since its the new default behavior of Flink 1.2. I'm against changing that in Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so that users upgrading from 1.2.0 to 1.2.1 don't notice it.

On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]> wrote:
Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>


Reply | Threaded
Open this post in threaded view
|

Re: Frontend classpath issue

Stephan Ewen
I think we need to get away from the dynamic class loading as much as possible. It breaks way to soon and causes easily class leaks.

I would be in favor if understanding how to fix this on the Flink side, i.e., either:

  - Having flags for disabling it optionally
  - Having an option of "user code first" or "user code last" in the classpath
  - Shading Netty in Flink. I think Netty is a good candidate to be shaded, actually.



On Mon, Mar 13, 2017 at 2:33 PM, Robert Metzger <[hidden email]> wrote:
I think the change reduces the chances to run into classloading issues in
case there's a bug in Flink (= it is using the wrong CL)

I've filed a JIRA for the problem:
https://issues.apache.org/jira/browse/FLINK-6031

On Fri, Feb 24, 2017 at 9:29 PM, Gyula Fóra <[hidden email]> wrote:

> Hi,
> I am wondering whether there is any scenario where the new way makes
> anything better under normal circumstances.
>
> I can only see how it will break things in subtle ways.
>
> If you think there is any real benefit to the current approach I dont mind
> having it as a default, otherwise I am in favor of reverting to the 1.1
> default. (My logic is that the user will only observe a difference in
> behavior when the new setup actually causes problems)
>
> Gyula
>
> On Fri, Feb 24, 2017, 17:53 Robert Metzger <[hidden email]> wrote:
>
>> The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't
>> mention any particular user or use case.
>>
>> I honestly care so much if we enable or disable it by default. But since
>> its the new default behavior of Flink 1.2. I'm against changing that in
>> Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so
>> that users upgrading from 1.2.0 to 1.2.1 don't notice it.
>>
>> On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]>
>> wrote:
>>
>> Did any user have problems with the Flink 1.1 behaviour? If not, we could
>> disable it again, by default, and add a flag for adding the user jar to all
>> the classpaths.
>>
>> On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
>>
>> I agree with you Gyula, this change is dangerous. I have seen another case
>> from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
>> didn't in 1.1.x
>>
>> I wonder if we should introduce a config flag for Flink 1.2.1 to disable
>> the behavior if needed.
>>
>> On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:
>>
>> > On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]>
>> wrote:
>> > > I was not aware of this big change (I know it's my fault) but I am not
>> > sure
>> > > if I agree with the rationale.
>> >
>> > No comment on the actual issue from my side, but I strongly disagree
>> > that this is your fault. We should have covered this better in the
>> > release announcement in my opinion. Of course, this doesn't help now.
>> > ;-)
>> >
>> > – Ufuk
>> >
>>
>>
>>