(DEPRECATED) Apache Flink User Mailing List archive.

Frontend classpath issue

Classic

List

Threaded

12 messages Options

Gyula Fóra-2

Frontend classpath issue

Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:

https://gist.github.com/gyfora/4ec2c8a8a6b33adb80d411460432ce8d

So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!

Gyula

rmetzger0

Re: Frontend classpath issue

Hi,

Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.

Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:

Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:

https://gist.github.com/gyfora/4ec2c8a8a6b33adb80d411460432ce8d

So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula

Gyula Fóra

Re: Frontend classpath issue

Hi Robert,

It definitely explains the behaviour.

This only applies to the frontend right?

If so what is the rationale behind it, and how should I handle the dependency conflict?

Thanks,

Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 21:44):

Hi,
Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.
Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:

https://gist.github.com/gyfora/4ec2c8a8a6b33adb80d411460432ce8d

So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula

rmetzger0

Re: Frontend classpath issue

Mh. The user jar is put into every classpath. So the jobmanager / taskmanagers are potentially affected by this as well.

Probably the data transfer between the TMs doesn't call the same methods as the UI on the JobManager :)

The simplest solution is to shade your netty in the user jar into a different location.

On Thu, Feb 23, 2017 at 10:01 PM, Gyula Fóra <[hidden email]> wrote:

Hi Robert,
It definitely explains the behaviour.

This only applies to the frontend right?
If so what is the rationale behind it, and how should I handle the dependency conflict?

Thanks,
Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 21:44):
Hi,
Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.
Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:

https://gist.github.com/gyfora/4ec2c8a8a6b33adb80d411460432ce8d

So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula

Gyula Fóra

Re: Frontend classpath issue

Hi Robert,

I was not aware of this big change (I know it's my fault) but I am not sure if I agree with the rationale.

I read through the JIRA and it seems that this is mostly a convenience change that we dont need to copy jars and mess with the classloading that much.

On the other hand if user jars can conflict with frontend/backend classes that can lead to very serious (and hard to fix) problems, especially in larger scale deployments.

What do you think about this?

Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 22:10):

Mh. The user jar is put into every classpath. So the jobmanager / taskmanagers are potentially affected by this as well.
Probably the data transfer between the TMs doesn't call the same methods as the UI on the JobManager :)

The simplest solution is to shade your netty in the user jar into a different location.

On Thu, Feb 23, 2017 at 10:01 PM, Gyula Fóra <[hidden email]> wrote:
Hi Robert,
It definitely explains the behaviour.

This only applies to the frontend right?
If so what is the rationale behind it, and how should I handle the dependency conflict?

Thanks,
Gyula

Robert Metzger <[hidden email]> ezt írta (időpont: 2017. febr. 23., Cs, 21:44):
Hi,
Since Flink 1.2 "per job yarn applications" (when you do "-m yarn-cluster") include the job jar into the classpath as well.
Does this change explain the behavior?

On Thu, Feb 23, 2017 at 4:59 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I have a problem that the frontend somehow seems to have the user jar on the classpath and it leads to a netty conflict:

https://gist.github.com/gyfora/4ec2c8a8a6b33adb80d411460432ce8d

So in the jobmanager logs I can see that my job started (running on YARN), but can't access the frontend, it gives internal server error with the previous exception. So I dont have the same jar problem on the actual running job.

I haven't really seen this before, is this something that happened to somebody else as well?

Thank you!
Gyula

Ufuk Celebi

Re: Frontend classpath issue

On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> I was not aware of this big change (I know it's my fault) but I am not sure
> if I agree with the rationale.

No comment on the actual issue from my side, but I strongly disagree
that this is your fault. We should have covered this better in the
release announcement in my opinion. Of course, this doesn't help now.
;-)

– Ufuk

rmetzger0

Re: Frontend classpath issue

I agree with you Gyula, this change is dangerous. I have seen another case from a user with Hadoop dependencies that crashed in Flink 1.2.0 that didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> I was not aware of this big change (I know it's my fault) but I am not sure
> if I agree with the rationale.

No comment on the actual issue from my side, but I strongly disagree
that this is your fault. We should have covered this better in the
release announcement in my opinion. Of course, this doesn't help now.
;-)

– Ufuk

Aljoscha Krettek

Re: Frontend classpath issue

Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:

I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>

rmetzger0

Re: Frontend classpath issue

The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't mention any particular user or use case.

I honestly care so much if we enable or disable it by default. But since its the new default behavior of Flink 1.2. I'm against changing that in Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so that users upgrading from 1.2.0 to 1.2.1 don't notice it.

On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]> wrote:

Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>

Gyula Fóra

Re: Frontend classpath issue

Hi,
I am wondering whether there is any scenario where the new way makes anything better under normal circumstances.

I can only see how it will break things in subtle ways.

If you think there is any real benefit to the current approach I dont mind having it as a default, otherwise I am in favor of reverting to the 1.1 default. (My logic is that the user will only observe a difference in behavior when the new setup actually causes problems)

Gyula

On Fri, Feb 24, 2017, 17:53 Robert Metzger <[hidden email]> wrote:

The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't mention any particular user or use case.

I honestly care so much if we enable or disable it by default. But since its the new default behavior of Flink 1.2. I'm against changing that in Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so that users upgrading from 1.2.0 to 1.2.1 don't notice it.

On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]> wrote:
Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>

rmetzger0

Re: Frontend classpath issue

I think the change reduces the chances to run into classloading issues in case there's a bug in Flink (= it is using the wrong CL)

I've filed a JIRA for the problem: https://issues.apache.org/jira/browse/FLINK-6031

On Fri, Feb 24, 2017 at 9:29 PM, Gyula Fóra <[hidden email]> wrote:

Hi,
I am wondering whether there is any scenario where the new way makes anything better under normal circumstances.

I can only see how it will break things in subtle ways.

If you think there is any real benefit to the current approach I dont mind having it as a default, otherwise I am in favor of reverting to the 1.1 default. (My logic is that the user will only observe a difference in behavior when the new setup actually causes problems)

Gyula

On Fri, Feb 24, 2017, 17:53 Robert Metzger <[hidden email]> wrote:
The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't mention any particular user or use case.

I honestly care so much if we enable or disable it by default. But since its the new default behavior of Flink 1.2. I'm against changing that in Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so that users upgrading from 1.2.0 to 1.2.1 don't notice it.

On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]> wrote:
Did any user have problems with the Flink 1.1 behaviour? If not, we could disable it again, by default, and add a flag for adding the user jar to all the classpaths.

On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
I agree with you Gyula, this change is dangerous. I have seen another case
from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
didn't in 1.1.x

I wonder if we should introduce a config flag for Flink 1.2.1 to disable
the behavior if needed.

On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:

> On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]> wrote:
> > I was not aware of this big change (I know it's my fault) but I am not
> sure
> > if I agree with the rationale.
>
> No comment on the actual issue from my side, but I strongly disagree
> that this is your fault. We should have covered this better in the
> release announcement in my opinion. Of course, this doesn't help now.
> ;-)
>
> – Ufuk
>

Stephan Ewen

Re: Frontend classpath issue

I think we need to get away from the dynamic class loading as much as possible. It breaks way to soon and causes easily class leaks.

I would be in favor if understanding how to fix this on the Flink side, i.e., either:

- Having flags for disabling it optionally

- Having an option of "user code first" or "user code last" in the classpath

- Shading Netty in Flink. I think Netty is a good candidate to be shaded, actually.

On Mon, Mar 13, 2017 at 2:33 PM, Robert Metzger <[hidden email]> wrote:

I think the change reduces the chances to run into classloading issues in
case there's a bug in Flink (= it is using the wrong CL)

I've filed a JIRA for the problem:
https://issues.apache.org/jira/browse/FLINK-6031

On Fri, Feb 24, 2017 at 9:29 PM, Gyula Fóra <[hidden email]> wrote:

> Hi,
> I am wondering whether there is any scenario where the new way makes
> anything better under normal circumstances.
>
> I can only see how it will break things in subtle ways.
>
> If you think there is any real benefit to the current approach I dont mind
> having it as a default, otherwise I am in favor of reverting to the 1.1
> default. (My logic is that the user will only observe a difference in
> behavior when the new setup actually causes problems)
>
> Gyula
>
> On Fri, Feb 24, 2017, 17:53 Robert Metzger <[hidden email]> wrote:
>
>> The JIRA (https://issues.apache.org/jira/browse/FLINK-4913) doesn't
>> mention any particular user or use case.
>>
>> I honestly care so much if we enable or disable it by default. But since
>> its the new default behavior of Flink 1.2. I'm against changing that in
>> Flink 1.2.1, that's why I proposed to add a flag to disable it in 1.2.1, so
>> that users upgrading from 1.2.0 to 1.2.1 don't notice it.
>>
>> On Fri, Feb 24, 2017 at 5:41 PM, Aljoscha Krettek <[hidden email]>
>> wrote:
>>
>> Did any user have problems with the Flink 1.1 behaviour? If not, we could
>> disable it again, by default, and add a flag for adding the user jar to all
>> the classpaths.
>>
>> On Fri, 24 Feb 2017 at 14:50 Robert Metzger <[hidden email]> wrote:
>>
>> I agree with you Gyula, this change is dangerous. I have seen another case
>> from a user with Hadoop dependencies that crashed in Flink 1.2.0 that
>> didn't in 1.1.x
>>
>> I wonder if we should introduce a config flag for Flink 1.2.1 to disable
>> the behavior if needed.
>>
>> On Fri, Feb 24, 2017 at 2:27 PM, Ufuk Celebi <[hidden email]> wrote:
>>
>> > On Fri, Feb 24, 2017 at 11:05 AM, Gyula Fóra <[hidden email]>
>> wrote:
>> > > I was not aware of this big change (I know it's my fault) but I am not
>> > sure
>> > > if I agree with the rationale.
>> >
>> > No comment on the actual issue from my side, but I strongly disagree
>> > that this is your fault. We should have covered this better in the
>> > release announcement in my opinion. Of course, this doesn't help now.
>> > ;-)
>> >
>> > – Ufuk
>> >
>>
>>
>>