Flink 1.1.3 OOME Permgen

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.1.3 OOME Permgen

snntr
Hi everyone,

since upgrading to Flink 1.1.3 we observe frequent OOME Permgen Taskmanager Failures. Monitoring the permgen size on one of the Taskamanagers you can see that each Job (New Job and Restarts) adds a few MB, which can not be collected. Eventually, the OOME happens. This happens with all our Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.

On Flink 1.0.2 this was not a problem, but I will investigate it further.

The assumption is that Flink is somehow using one of the classes, which comes with our jar and by that prevents the gc of the whole class loader. Our Jars do not include any flink dependencies though (compileOnly), but of course many others.

Any ideas anyone?

Cheers and thank you,

Konstantin

sent from my phone. Plz excuse brevity and tpyos.
---
Konstantin Knauf *[hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

Stefan Richter
Hi,

could you somehow provide us a heap dump from a TM that run for a while (ideally, shortly before an OOME)? This would greatly help us to figure out if there is a classloader leak that causes the problem.

Best,
Stefan

> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf <[hidden email]>:
>
> Hi everyone,
>
> since upgrading to Flink 1.1.3 we observe frequent OOME Permgen Taskmanager Failures. Monitoring the permgen size on one of the Taskamanagers you can see that each Job (New Job and Restarts) adds a few MB, which can not be collected. Eventually, the OOME happens. This happens with all our Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>
> On Flink 1.0.2 this was not a problem, but I will investigate it further.
>
> The assumption is that Flink is somehow using one of the classes, which comes with our jar and by that prevents the gc of the whole class loader. Our Jars do not include any flink dependencies though (compileOnly), but of course many others.
>
> Any ideas anyone?
>
> Cheers and thank you,
>
> Konstantin
>
> sent from my phone. Plz excuse brevity and tpyos.
> ---
> Konstantin Knauf *[hidden email] * +49-174-3413182
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

snntr
Hi Stefan,

unfortunately, I can not share any heap dumps with you. I was able to
resolve some of the issues my self today, the root causes were different
for different jobs.

1) Jackson 2.7.2 (which comes with Flink) has a known class loading
issue (see https://github.com/FasterXML/jackson-databind/issues/1363).
Shipping a shaded version of Jackson 2.8.4 with our user code helped. I
recommend upgrading Flink's Jackson version soon.

2) We have a dependency on the flink-table [1] , which ships with
Calcite including the Calcite JDBC Driver, which can not been collected
cause of the known problem with the java.sql.DriverManager. Putting the
flink-table in Flink's lib dir instead of shipping it with the user code
helps. You should update the documentation, because this will always
happen when using flink-table, I think. So I wonder, why this has not
come up before actually.

3) Unresolved: Some Threads in a custom source which are not proberly
shut down and keep references to the UserCodeClassLoader. I did not have
time to look into this issue so far.

Cheers,

Konstantin

[1] Side note: We only need flink-table for the "Row" class used in the
JdbcOutputFormat, so it might make sense to move this class somewhere
else. Naturally, we also tried to exclude the "transitive" dependency on
org.apache.calcite until we noticed that calcite is packaged with
flink-table, so that you can not even exclude it. What is the reasons
for this?




On 30.11.2016 00:55, Stefan Richter wrote:

> Hi,
>
> could you somehow provide us a heap dump from a TM that run for a while (ideally, shortly before an OOME)? This would greatly help us to figure out if there is a classloader leak that causes the problem.
>
> Best,
> Stefan
>
>> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf <[hidden email]>:
>>
>> Hi everyone,
>>
>> since upgrading to Flink 1.1.3 we observe frequent OOME Permgen Taskmanager Failures. Monitoring the permgen size on one of the Taskamanagers you can see that each Job (New Job and Restarts) adds a few MB, which can not be collected. Eventually, the OOME happens. This happens with all our Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>>
>> On Flink 1.0.2 this was not a problem, but I will investigate it further.
>>
>> The assumption is that Flink is somehow using one of the classes, which comes with our jar and by that prevents the gc of the whole class loader. Our Jars do not include any flink dependencies though (compileOnly), but of course many others.
>>
>> Any ideas anyone?
>>
>> Cheers and thank you,
>>
>> Konstantin
>>
>> sent from my phone. Plz excuse brevity and tpyos.
>> ---
>> Konstantin Knauf *[hidden email] * +49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>
>
--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

Fabian Hueske-2
Hi Konstantin,

Regarding 2): I've opened FLINK-5227 to update the documentation [1].

Regarding the Row type: The Row type was introduced for flink-table and was later used by other modules. There is FLINK-5186 to move Row and all the related TypeInfo (+serializer and comparator) to flink-core [2]. That should solve your issue.

Some of the connector modules which provide TableSource and TableSinks have dependencies on flink-table as well. I'll check that these are optional dependencies to avoid that we pull in Calcite through connectors for jobs that do not not need it.

Thanks,
Fabian

2016-11-30 17:51 GMT+01:00 Konstantin Knauf <[hidden email]>:
Hi Stefan,

unfortunately, I can not share any heap dumps with you. I was able to
resolve some of the issues my self today, the root causes were different
for different jobs.

1) Jackson 2.7.2 (which comes with Flink) has a known class loading
issue (see https://github.com/FasterXML/jackson-databind/issues/1363).
Shipping a shaded version of Jackson 2.8.4 with our user code helped. I
recommend upgrading Flink's Jackson version soon.

2) We have a dependency on the flink-table [1] , which ships with
Calcite including the Calcite JDBC Driver, which can not been collected
cause of the known problem with the java.sql.DriverManager. Putting the
flink-table in Flink's lib dir instead of shipping it with the user code
helps. You should update the documentation, because this will always
happen when using flink-table, I think. So I wonder, why this has not
come up before actually.

3) Unresolved: Some Threads in a custom source which are not proberly
shut down and keep references to the UserCodeClassLoader. I did not have
time to look into this issue so far.

Cheers,

Konstantin

[1] Side note: We only need flink-table for the "Row" class used in the
JdbcOutputFormat, so it might make sense to move this class somewhere
else. Naturally, we also tried to exclude the "transitive" dependency on
org.apache.calcite until we noticed that calcite is packaged with
flink-table, so that you can not even exclude it. What is the reasons
for this?




On 30.11.2016 00:55, Stefan Richter wrote:
> Hi,
>
> could you somehow provide us a heap dump from a TM that run for a while (ideally, shortly before an OOME)? This would greatly help us to figure out if there is a classloader leak that causes the problem.
>
> Best,
> Stefan
>
>> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf <[hidden email]>:
>>
>> Hi everyone,
>>
>> since upgrading to Flink 1.1.3 we observe frequent OOME Permgen Taskmanager Failures. Monitoring the permgen size on one of the Taskamanagers you can see that each Job (New Job and Restarts) adds a few MB, which can not be collected. Eventually, the OOME happens. This happens with all our Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>>
>> On Flink 1.0.2 this was not a problem, but I will investigate it further.
>>
>> The assumption is that Flink is somehow using one of the classes, which comes with our jar and by that prevents the gc of the whole class loader. Our Jars do not include any flink dependencies though (compileOnly), but of course many others.
>>
>> Any ideas anyone?
>>
>> Cheers and thank you,
>>
>> Konstantin
>>
>> sent from my phone. Plz excuse brevity and tpyos.
>> ---
>> Konstantin Knauf *[hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182">+49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>
>

--
Konstantin Knauf * [hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182">+49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082


Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

rmetzger0
Thank you for reporting the issue Konstantin.
I've filed a JIRA for the jackson issue: https://issues.apache.org/jira/browse/FLINK-5233.
As I said in the JIRA, I propose to upgrade to Jackson 2.7.8, as this version contains the fix for the issue, but its not a major jackson upgrade.

Any chance you could try to if 2.7.8 fixes the issue as well?


On Fri, Dec 2, 2016 at 11:12 AM, Fabian Hueske <[hidden email]> wrote:
Hi Konstantin,

Regarding 2): I've opened FLINK-5227 to update the documentation [1].

Regarding the Row type: The Row type was introduced for flink-table and was later used by other modules. There is FLINK-5186 to move Row and all the related TypeInfo (+serializer and comparator) to flink-core [2]. That should solve your issue.

Some of the connector modules which provide TableSource and TableSinks have dependencies on flink-table as well. I'll check that these are optional dependencies to avoid that we pull in Calcite through connectors for jobs that do not not need it.

Thanks,
Fabian

2016-11-30 17:51 GMT+01:00 Konstantin Knauf <[hidden email]>:
Hi Stefan,

unfortunately, I can not share any heap dumps with you. I was able to
resolve some of the issues my self today, the root causes were different
for different jobs.

1) Jackson 2.7.2 (which comes with Flink) has a known class loading
issue (see https://github.com/FasterXML/jackson-databind/issues/1363).
Shipping a shaded version of Jackson 2.8.4 with our user code helped. I
recommend upgrading Flink's Jackson version soon.

2) We have a dependency on the flink-table [1] , which ships with
Calcite including the Calcite JDBC Driver, which can not been collected
cause of the known problem with the java.sql.DriverManager. Putting the
flink-table in Flink's lib dir instead of shipping it with the user code
helps. You should update the documentation, because this will always
happen when using flink-table, I think. So I wonder, why this has not
come up before actually.

3) Unresolved: Some Threads in a custom source which are not proberly
shut down and keep references to the UserCodeClassLoader. I did not have
time to look into this issue so far.

Cheers,

Konstantin

[1] Side note: We only need flink-table for the "Row" class used in the
JdbcOutputFormat, so it might make sense to move this class somewhere
else. Naturally, we also tried to exclude the "transitive" dependency on
org.apache.calcite until we noticed that calcite is packaged with
flink-table, so that you can not even exclude it. What is the reasons
for this?




On 30.11.2016 00:55, Stefan Richter wrote:
> Hi,
>
> could you somehow provide us a heap dump from a TM that run for a while (ideally, shortly before an OOME)? This would greatly help us to figure out if there is a classloader leak that causes the problem.
>
> Best,
> Stefan
>
>> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf <[hidden email]>:
>>
>> Hi everyone,
>>
>> since upgrading to Flink 1.1.3 we observe frequent OOME Permgen Taskmanager Failures. Monitoring the permgen size on one of the Taskamanagers you can see that each Job (New Job and Restarts) adds a few MB, which can not be collected. Eventually, the OOME happens. This happens with all our Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>>
>> On Flink 1.0.2 this was not a problem, but I will investigate it further.
>>
>> The assumption is that Flink is somehow using one of the classes, which comes with our jar and by that prevents the gc of the whole class loader. Our Jars do not include any flink dependencies though (compileOnly), but of course many others.
>>
>> Any ideas anyone?
>>
>> Cheers and thank you,
>>
>> Konstantin
>>
>> sent from my phone. Plz excuse brevity and tpyos.
>> ---
>> Konstantin Knauf *[hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182" target="_blank">+49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>
>

--
Konstantin Knauf * [hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182" target="_blank">+49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082



Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

rmetzger0
I've submitted Wordcount 410 times to a testing cluster and a streaming job 290 times and I could not reproduce the issue with 1.1.3. Also, the heapdump of one of the TaskManagers looked pretty normal.

Do you have any ideas how to reproduce the issue?

On Fri, Dec 2, 2016 at 3:21 PM, Robert Metzger <[hidden email]> wrote:
Thank you for reporting the issue Konstantin.
I've filed a JIRA for the jackson issue: https://issues.apache.org/jira/browse/FLINK-5233.
As I said in the JIRA, I propose to upgrade to Jackson 2.7.8, as this version contains the fix for the issue, but its not a major jackson upgrade.

Any chance you could try to if 2.7.8 fixes the issue as well?


On Fri, Dec 2, 2016 at 11:12 AM, Fabian Hueske <[hidden email]> wrote:
Hi Konstantin,

Regarding 2): I've opened FLINK-5227 to update the documentation [1].

Regarding the Row type: The Row type was introduced for flink-table and was later used by other modules. There is FLINK-5186 to move Row and all the related TypeInfo (+serializer and comparator) to flink-core [2]. That should solve your issue.

Some of the connector modules which provide TableSource and TableSinks have dependencies on flink-table as well. I'll check that these are optional dependencies to avoid that we pull in Calcite through connectors for jobs that do not not need it.

Thanks,
Fabian

2016-11-30 17:51 GMT+01:00 Konstantin Knauf <[hidden email]>:
Hi Stefan,

unfortunately, I can not share any heap dumps with you. I was able to
resolve some of the issues my self today, the root causes were different
for different jobs.

1) Jackson 2.7.2 (which comes with Flink) has a known class loading
issue (see https://github.com/FasterXML/jackson-databind/issues/1363).
Shipping a shaded version of Jackson 2.8.4 with our user code helped. I
recommend upgrading Flink's Jackson version soon.

2) We have a dependency on the flink-table [1] , which ships with
Calcite including the Calcite JDBC Driver, which can not been collected
cause of the known problem with the java.sql.DriverManager. Putting the
flink-table in Flink's lib dir instead of shipping it with the user code
helps. You should update the documentation, because this will always
happen when using flink-table, I think. So I wonder, why this has not
come up before actually.

3) Unresolved: Some Threads in a custom source which are not proberly
shut down and keep references to the UserCodeClassLoader. I did not have
time to look into this issue so far.

Cheers,

Konstantin

[1] Side note: We only need flink-table for the "Row" class used in the
JdbcOutputFormat, so it might make sense to move this class somewhere
else. Naturally, we also tried to exclude the "transitive" dependency on
org.apache.calcite until we noticed that calcite is packaged with
flink-table, so that you can not even exclude it. What is the reasons
for this?




On 30.11.2016 00:55, Stefan Richter wrote:
> Hi,
>
> could you somehow provide us a heap dump from a TM that run for a while (ideally, shortly before an OOME)? This would greatly help us to figure out if there is a classloader leak that causes the problem.
>
> Best,
> Stefan
>
>> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf <[hidden email]>:
>>
>> Hi everyone,
>>
>> since upgrading to Flink 1.1.3 we observe frequent OOME Permgen Taskmanager Failures. Monitoring the permgen size on one of the Taskamanagers you can see that each Job (New Job and Restarts) adds a few MB, which can not be collected. Eventually, the OOME happens. This happens with all our Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>>
>> On Flink 1.0.2 this was not a problem, but I will investigate it further.
>>
>> The assumption is that Flink is somehow using one of the classes, which comes with our jar and by that prevents the gc of the whole class loader. Our Jars do not include any flink dependencies though (compileOnly), but of course many others.
>>
>> Any ideas anyone?
>>
>> Cheers and thank you,
>>
>> Konstantin
>>
>> sent from my phone. Plz excuse brevity and tpyos.
>> ---
>> Konstantin Knauf *[hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182" target="_blank">+49-174-3413182
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>
>

--
Konstantin Knauf * [hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182" target="_blank">+49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082




Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

snntr
Hi Robert,

you need to actually use Jackson. The problematic field is a cache,
which is filled by all classes, which were serialized/deserialized by
Jackson.

Best,

Konstantin

On 05.12.2016 11:55, Robert Metzger wrote:

> I've submitted Wordcount 410 times to a testing cluster and a streaming
> job 290 times and I could not reproduce the issue with 1.1.3. Also, the
> heapdump of one of the TaskManagers looked pretty normal.
>
> Do you have any ideas how to reproduce the issue?
>
> On Fri, Dec 2, 2016 at 3:21 PM, Robert Metzger <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Thank you for reporting the issue Konstantin.
>     I've filed a JIRA for the jackson
>     issue: https://issues.apache.org/jira/browse/FLINK-5233
>     <https://issues.apache.org/jira/browse/FLINK-5233>.
>     As I said in the JIRA, I propose to upgrade to Jackson 2.7.8, as
>     this version contains the fix for the issue, but its not a major
>     jackson upgrade.
>
>     Any chance you could try to if 2.7.8 fixes the issue as well?
>
>
>     On Fri, Dec 2, 2016 at 11:12 AM, Fabian Hueske <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         Hi Konstantin,
>
>         Regarding 2): I've opened FLINK-5227 to update the documentation
>         [1].
>
>         Regarding the Row type: The Row type was introduced for
>         flink-table and was later used by other modules. There is
>         FLINK-5186 to move Row and all the related TypeInfo (+serializer
>         and comparator) to flink-core [2]. That should solve your issue.
>
>         Some of the connector modules which provide TableSource and
>         TableSinks have dependencies on flink-table as well. I'll check
>         that these are optional dependencies to avoid that we pull in
>         Calcite through connectors for jobs that do not not need it.
>
>         Thanks,
>         Fabian
>
>         [1] https://issues.apache.org/jira/browse/FLINK-5227
>         <https://issues.apache.org/jira/browse/FLINK-5227>
>         [2] https://issues.apache.org/jira/browse/FLINK-5186
>         <https://issues.apache.org/jira/browse/FLINK-5186>
>
>         2016-11-30 17:51 GMT+01:00 Konstantin Knauf
>         <[hidden email]
>         <mailto:[hidden email]>>:
>
>             Hi Stefan,
>
>             unfortunately, I can not share any heap dumps with you. I
>             was able to
>             resolve some of the issues my self today, the root causes
>             were different
>             for different jobs.
>
>             1) Jackson 2.7.2 (which comes with Flink) has a known class
>             loading
>             issue (see
>             https://github.com/FasterXML/jackson-databind/issues/1363
>             <https://github.com/FasterXML/jackson-databind/issues/1363>).
>             Shipping a shaded version of Jackson 2.8.4 with our user
>             code helped. I
>             recommend upgrading Flink's Jackson version soon.
>
>             2) We have a dependency on the flink-table [1] , which ships
>             with
>             Calcite including the Calcite JDBC Driver, which can not
>             been collected
>             cause of the known problem with the java.sql.DriverManager.
>             Putting the
>             flink-table in Flink's lib dir instead of shipping it with
>             the user code
>             helps. You should update the documentation, because this
>             will always
>             happen when using flink-table, I think. So I wonder, why
>             this has not
>             come up before actually.
>
>             3) Unresolved: Some Threads in a custom source which are not
>             proberly
>             shut down and keep references to the UserCodeClassLoader. I
>             did not have
>             time to look into this issue so far.
>
>             Cheers,
>
>             Konstantin
>
>             [1] Side note: We only need flink-table for the "Row" class
>             used in the
>             JdbcOutputFormat, so it might make sense to move this class
>             somewhere
>             else. Naturally, we also tried to exclude the "transitive"
>             dependency on
>             org.apache.calcite until we noticed that calcite is packaged
>             with
>             flink-table, so that you can not even exclude it. What is
>             the reasons
>             for this?
>
>
>
>
>             On 30.11.2016 00:55, Stefan Richter wrote:
>             > Hi,
>             >
>             > could you somehow provide us a heap dump from a TM that
>             run for a while (ideally, shortly before an OOME)? This
>             would greatly help us to figure out if there is a
>             classloader leak that causes the problem.
>             >
>             > Best,
>             > Stefan
>             >
>             >> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf
>             <[hidden email]
>             <mailto:[hidden email]>>:
>             >>
>             >> Hi everyone,
>             >>
>             >> since upgrading to Flink 1.1.3 we observe frequent OOME
>             Permgen Taskmanager Failures. Monitoring the permgen size on
>             one of the Taskamanagers you can see that each Job (New Job
>             and Restarts) adds a few MB, which can not be collected.
>             Eventually, the OOME happens. This happens with all our
>             Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>             >>
>             >> On Flink 1.0.2 this was not a problem, but I will
>             investigate it further.
>             >>
>             >> The assumption is that Flink is somehow using one of the
>             classes, which comes with our jar and by that prevents the
>             gc of the whole class loader. Our Jars do not include any
>             flink dependencies though (compileOnly), but of course many
>             others.
>             >>
>             >> Any ideas anyone?
>             >>
>             >> Cheers and thank you,
>             >>
>             >> Konstantin
>             >>
>             >> sent from my phone. Plz excuse brevity and tpyos.
>             >> ---
>             >> Konstantin Knauf *[hidden email]
>             <mailto:[hidden email]> * +49-174-3413182
>             <tel:%2B49-174-3413182>
>             >> TNG Technology Consulting GmbH, Betastr. 13a, 85774
>             Unterföhring
>             >> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr.
>             Robert Dahlke
>             >
>             >
>
>             --
>             Konstantin Knauf * [hidden email]
>             <mailto:[hidden email]> * +49-174-3413182
>             <tel:%2B49-174-3413182>
>             TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>             Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert
>             Dahlke
>             Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
>
>
--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

rmetzger0
I executed this snipped in each Flink job:

@Override
public void open(Configuration config) {
  ObjectMapper somethingWithJackson = new ObjectMapper();
  try {
    ObjectNode on = somethingWithJackson.readValue("{\"a\": \"b\"}", ObjectNode.class);
  } catch (IOException e) {
    throw new RuntimeException("You failed", e);
  }
}

But I suspect that I need to map my JSON to a POJO?
 

On Mon, Dec 5, 2016 at 12:33 PM, Konstantin Knauf <[hidden email]> wrote:
Hi Robert,

you need to actually use Jackson. The problematic field is a cache,
which is filled by all classes, which were serialized/deserialized by
Jackson.

Best,

Konstantin

On <a href="tel:05.12.2016%2011" value="+49512201611">05.12.2016 11:55, Robert Metzger wrote:
> I've submitted Wordcount 410 times to a testing cluster and a streaming
> job 290 times and I could not reproduce the issue with 1.1.3. Also, the
> heapdump of one of the TaskManagers looked pretty normal.
>
> Do you have any ideas how to reproduce the issue?
>
> On Fri, Dec 2, 2016 at 3:21 PM, Robert Metzger <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Thank you for reporting the issue Konstantin.
>     I've filed a JIRA for the jackson
>     issue: https://issues.apache.org/jira/browse/FLINK-5233
>     <https://issues.apache.org/jira/browse/FLINK-5233>.
>     As I said in the JIRA, I propose to upgrade to Jackson 2.7.8, as
>     this version contains the fix for the issue, but its not a major
>     jackson upgrade.
>
>     Any chance you could try to if 2.7.8 fixes the issue as well?
>
>
>     On Fri, Dec 2, 2016 at 11:12 AM, Fabian Hueske <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         Hi Konstantin,
>
>         Regarding 2): I've opened FLINK-5227 to update the documentation
>         [1].
>
>         Regarding the Row type: The Row type was introduced for
>         flink-table and was later used by other modules. There is
>         FLINK-5186 to move Row and all the related TypeInfo (+serializer
>         and comparator) to flink-core [2]. That should solve your issue.
>
>         Some of the connector modules which provide TableSource and
>         TableSinks have dependencies on flink-table as well. I'll check
>         that these are optional dependencies to avoid that we pull in
>         Calcite through connectors for jobs that do not not need it.
>
>         Thanks,
>         Fabian
>
>         [1] https://issues.apache.org/jira/browse/FLINK-5227
>         <https://issues.apache.org/jira/browse/FLINK-5227>
>         [2] https://issues.apache.org/jira/browse/FLINK-5186
>         <https://issues.apache.org/jira/browse/FLINK-5186>
>
>         2016-11-30 17:51 GMT+01:00 Konstantin Knauf
>         <[hidden email]
>         <mailto:[hidden email]>>:
>
>             Hi Stefan,
>
>             unfortunately, I can not share any heap dumps with you. I
>             was able to
>             resolve some of the issues my self today, the root causes
>             were different
>             for different jobs.
>
>             1) Jackson 2.7.2 (which comes with Flink) has a known class
>             loading
>             issue (see
>             https://github.com/FasterXML/jackson-databind/issues/1363
>             <https://github.com/FasterXML/jackson-databind/issues/1363>).
>             Shipping a shaded version of Jackson 2.8.4 with our user
>             code helped. I
>             recommend upgrading Flink's Jackson version soon.
>
>             2) We have a dependency on the flink-table [1] , which ships
>             with
>             Calcite including the Calcite JDBC Driver, which can not
>             been collected
>             cause of the known problem with the java.sql.DriverManager.
>             Putting the
>             flink-table in Flink's lib dir instead of shipping it with
>             the user code
>             helps. You should update the documentation, because this
>             will always
>             happen when using flink-table, I think. So I wonder, why
>             this has not
>             come up before actually.
>
>             3) Unresolved: Some Threads in a custom source which are not
>             proberly
>             shut down and keep references to the UserCodeClassLoader. I
>             did not have
>             time to look into this issue so far.
>
>             Cheers,
>
>             Konstantin
>
>             [1] Side note: We only need flink-table for the "Row" class
>             used in the
>             JdbcOutputFormat, so it might make sense to move this class
>             somewhere
>             else. Naturally, we also tried to exclude the "transitive"
>             dependency on
>             org.apache.calcite until we noticed that calcite is packaged
>             with
>             flink-table, so that you can not even exclude it. What is
>             the reasons
>             for this?
>
>
>
>
>             On 30.11.2016 00:55, Stefan Richter wrote:
>             > Hi,
>             >
>             > could you somehow provide us a heap dump from a TM that
>             run for a while (ideally, shortly before an OOME)? This
>             would greatly help us to figure out if there is a
>             classloader leak that causes the problem.
>             >
>             > Best,
>             > Stefan
>             >
>             >> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf
>             <[hidden email]
>             <mailto:[hidden email]>>:
>             >>
>             >> Hi everyone,
>             >>
>             >> since upgrading to Flink 1.1.3 we observe frequent OOME
>             Permgen Taskmanager Failures. Monitoring the permgen size on
>             one of the Taskamanagers you can see that each Job (New Job
>             and Restarts) adds a few MB, which can not be collected.
>             Eventually, the OOME happens. This happens with all our
>             Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>             >>
>             >> On Flink 1.0.2 this was not a problem, but I will
>             investigate it further.
>             >>
>             >> The assumption is that Flink is somehow using one of the
>             classes, which comes with our jar and by that prevents the
>             gc of the whole class loader. Our Jars do not include any
>             flink dependencies though (compileOnly), but of course many
>             others.
>             >>
>             >> Any ideas anyone?
>             >>
>             >> Cheers and thank you,
>             >>
>             >> Konstantin
>             >>
>             >> sent from my phone. Plz excuse brevity and tpyos.
>             >> ---
>             >> Konstantin Knauf *[hidden email]
>             <mailto:[hidden email]> * <a href="tel:%2B49-174-3413182" value="+491743413182">+49-174-3413182
>             <tel:%2B49-174-3413182>
>             >> TNG Technology Consulting GmbH, Betastr. 13a, 85774
>             Unterföhring
>             >> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr.
>             Robert Dahlke
>             >
>             >
>
>             --
>             Konstantin Knauf * [hidden email]
>             <mailto:[hidden email]> * <a href="tel:%2B49-174-3413182" value="+491743413182">+49-174-3413182
>             <tel:%2B49-174-3413182>
>             TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>             Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert
>             Dahlke
>             Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
>
>

--
Konstantin Knauf * [hidden email] * <a href="tel:%2B49-174-3413182" value="+491743413182">+49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082


Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.1.3 OOME Permgen

snntr
Yep, I would suppose so. You need to have the reference from the
AppClassLoader to the UserCodeClassLoader.

On 05.12.2016 12:37, Robert Metzger wrote:

> I executed this snipped in each Flink job:
>
> @Override
> public void open(Configuration config) {
>   ObjectMapper somethingWithJackson = new ObjectMapper();
>   try {
>     ObjectNode on = somethingWithJackson.readValue("{\"a\": \"b\"}",
> ObjectNode.class);
>   } catch (IOException e) {
>     throw new RuntimeException("You failed", e);
>   }
> }
>
> But I suspect that I need to map my JSON to a POJO?
>  
>
> On Mon, Dec 5, 2016 at 12:33 PM, Konstantin Knauf
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi Robert,
>
>     you need to actually use Jackson. The problematic field is a cache,
>     which is filled by all classes, which were serialized/deserialized by
>     Jackson.
>
>     Best,
>
>     Konstantin
>
>     On 05.12.2016 11 <tel:05.12.2016%2011>:55, Robert Metzger wrote:
>     > I've submitted Wordcount 410 times to a testing cluster and a streaming
>     > job 290 times and I could not reproduce the issue with 1.1.3. Also, the
>     > heapdump of one of the TaskManagers looked pretty normal.
>     >
>     > Do you have any ideas how to reproduce the issue?
>     >
>     > On Fri, Dec 2, 2016 at 3:21 PM, Robert Metzger <[hidden email] <mailto:[hidden email]>
>     > <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>     >
>     >     Thank you for reporting the issue Konstantin.
>     >     I've filed a JIRA for the jackson
>     >     issue: https://issues.apache.org/jira/browse/FLINK-5233
>     <https://issues.apache.org/jira/browse/FLINK-5233>
>     >     <https://issues.apache.org/jira/browse/FLINK-5233
>     <https://issues.apache.org/jira/browse/FLINK-5233>>.
>     >     As I said in the JIRA, I propose to upgrade to Jackson 2.7.8, as
>     >     this version contains the fix for the issue, but its not a major
>     >     jackson upgrade.
>     >
>     >     Any chance you could try to if 2.7.8 fixes the issue as well?
>     >
>     >
>     >     On Fri, Dec 2, 2016 at 11:12 AM, Fabian Hueske <[hidden email] <mailto:[hidden email]>
>     >     <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>     >
>     >         Hi Konstantin,
>     >
>     >         Regarding 2): I've opened FLINK-5227 to update the documentation
>     >         [1].
>     >
>     >         Regarding the Row type: The Row type was introduced for
>     >         flink-table and was later used by other modules. There is
>     >         FLINK-5186 to move Row and all the related TypeInfo (+serializer
>     >         and comparator) to flink-core [2]. That should solve your issue.
>     >
>     >         Some of the connector modules which provide TableSource and
>     >         TableSinks have dependencies on flink-table as well. I'll check
>     >         that these are optional dependencies to avoid that we pull in
>     >         Calcite through connectors for jobs that do not not need it.
>     >
>     >         Thanks,
>     >         Fabian
>     >
>     >         [1] https://issues.apache.org/jira/browse/FLINK-5227
>     <https://issues.apache.org/jira/browse/FLINK-5227>
>     >         <https://issues.apache.org/jira/browse/FLINK-5227
>     <https://issues.apache.org/jira/browse/FLINK-5227>>
>     >         [2] https://issues.apache.org/jira/browse/FLINK-5186
>     <https://issues.apache.org/jira/browse/FLINK-5186>
>     >         <https://issues.apache.org/jira/browse/FLINK-5186
>     <https://issues.apache.org/jira/browse/FLINK-5186>>
>     >
>     >         2016-11-30 17:51 GMT+01:00 Konstantin Knauf
>     >         <[hidden email] <mailto:[hidden email]>
>     >         <mailto:[hidden email]
>     <mailto:[hidden email]>>>:
>     >
>     >             Hi Stefan,
>     >
>     >             unfortunately, I can not share any heap dumps with you. I
>     >             was able to
>     >             resolve some of the issues my self today, the root causes
>     >             were different
>     >             for different jobs.
>     >
>     >             1) Jackson 2.7.2 (which comes with Flink) has a known class
>     >             loading
>     >             issue (see
>     >             https://github.com/FasterXML/jackson-databind/issues/1363
>     <https://github.com/FasterXML/jackson-databind/issues/1363>
>     >          
>      <https://github.com/FasterXML/jackson-databind/issues/1363
>     <https://github.com/FasterXML/jackson-databind/issues/1363>>).
>     >             Shipping a shaded version of Jackson 2.8.4 with our user
>     >             code helped. I
>     >             recommend upgrading Flink's Jackson version soon.
>     >
>     >             2) We have a dependency on the flink-table [1] , which
>     ships
>     >             with
>     >             Calcite including the Calcite JDBC Driver, which can not
>     >             been collected
>     >             cause of the known problem with the
>     java.sql.DriverManager.
>     >             Putting the
>     >             flink-table in Flink's lib dir instead of shipping it with
>     >             the user code
>     >             helps. You should update the documentation, because this
>     >             will always
>     >             happen when using flink-table, I think. So I wonder, why
>     >             this has not
>     >             come up before actually.
>     >
>     >             3) Unresolved: Some Threads in a custom source which
>     are not
>     >             proberly
>     >             shut down and keep references to the
>     UserCodeClassLoader. I
>     >             did not have
>     >             time to look into this issue so far.
>     >
>     >             Cheers,
>     >
>     >             Konstantin
>     >
>     >             [1] Side note: We only need flink-table for the "Row"
>     class
>     >             used in the
>     >             JdbcOutputFormat, so it might make sense to move this
>     class
>     >             somewhere
>     >             else. Naturally, we also tried to exclude the "transitive"
>     >             dependency on
>     >             org.apache.calcite until we noticed that calcite is
>     packaged
>     >             with
>     >             flink-table, so that you can not even exclude it. What is
>     >             the reasons
>     >             for this?
>     >
>     >
>     >
>     >
>     >             On 30.11.2016 00:55, Stefan Richter wrote:
>     >             > Hi,
>     >             >
>     >             > could you somehow provide us a heap dump from a TM that
>     >             run for a while (ideally, shortly before an OOME)? This
>     >             would greatly help us to figure out if there is a
>     >             classloader leak that causes the problem.
>     >             >
>     >             > Best,
>     >             > Stefan
>     >             >
>     >             >> Am 29.11.2016 um 18:39 schrieb Konstantin Knauf
>     >             <[hidden email]
>     <mailto:[hidden email]>
>     >             <mailto:[hidden email]
>     <mailto:[hidden email]>>>:
>     >             >>
>     >             >> Hi everyone,
>     >             >>
>     >             >> since upgrading to Flink 1.1.3 we observe frequent OOME
>     >             Permgen Taskmanager Failures. Monitoring the permgen size on
>     >             one of the Taskamanagers you can see that each Job (New Job
>     >             and Restarts) adds a few MB, which can not be collected.
>     >             Eventually, the OOME happens. This happens with all our
>     >             Jobs, Streaming and Batch, on Yarn 2.4 as well as Stand-Alone.
>     >             >>
>     >             >> On Flink 1.0.2 this was not a problem, but I will
>     >             investigate it further.
>     >             >>
>     >             >> The assumption is that Flink is somehow using one of the
>     >             classes, which comes with our jar and by that prevents the
>     >             gc of the whole class loader. Our Jars do not include any
>     >             flink dependencies though (compileOnly), but of course many
>     >             others.
>     >             >>
>     >             >> Any ideas anyone?
>     >             >>
>     >             >> Cheers and thank you,
>     >             >>
>     >             >> Konstantin
>     >             >>
>     >             >> sent from my phone. Plz excuse brevity and tpyos.
>     >             >> ---
>     >             >> Konstantin Knauf *[hidden email] <mailto:[hidden email]>
>     >             <mailto:[hidden email]
>     <mailto:[hidden email]>> * +49-174-3413182
>     <tel:%2B49-174-3413182>
>     >             <tel:%2B49-174-3413182>
>     >             >> TNG Technology Consulting GmbH, Betastr. 13a, 85774
>     >             Unterföhring
>     >             >> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr.
>     >             Robert Dahlke
>     >             >
>     >             >
>     >
>     >             --
>     >             Konstantin Knauf * [hidden email] <mailto:[hidden email]>
>     >             <mailto:[hidden email]
>     <mailto:[hidden email]>> * +49-174-3413182
>     <tel:%2B49-174-3413182>
>     >             <tel:%2B49-174-3413182>
>     >             TNG Technology Consulting GmbH, Betastr. 13a, 85774
>     Unterföhring
>     >             Geschäftsführer: Henrik Klagges, Christoph Stock, Dr.
>     Robert
>     >             Dahlke
>     >             Sitz: Unterföhring * Amtsgericht München * HRB 135082
>     >
>     >
>     >
>     >
>
>     --
>     Konstantin Knauf * [hidden email]
>     <mailto:[hidden email]> * +49-174-3413182
>     <tel:%2B49-174-3413182>
>     TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>     Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>     Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
--
Konstantin Knauf * [hidden email] * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082


signature.asc (836 bytes) Download Attachment