NoClassDefFoundError of a Avro class after cancel then resubmit the same job

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

NoClassDefFoundError of a Avro class after cancel then resubmit the same job

xiatao123
Not sure why, when I submit the job at the first time after a cluster launch,
it is working fine.
After I cancelled the first job, then resubmit the same job again, it will
hit the NoClassDefFoundError.
Very weird, feels like some clean up of a cancelled job messed up future job
of the same classes.
Anyone got the same issue?

java.lang.NoClassDefFoundError: com.xxx.yyy.zzz$Builder
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
        at java.lang.Class.getDeclaredMethods(Class.java:1975)
        at
org.apache.flink.api.java.typeutils.TypeExtractionUtils.getAllDeclaredMethods(TypeExtractionUtils.java:243)
        at
org.apache.flink.api.java.typeutils.TypeExtractor.analyzePojo(TypeExtractor.java:1949)
        at
org.apache.flink.api.java.typeutils.AvroTypeInfo.generateFieldsFromAvroSchema(AvroTypeInfo.java:55)
        at
org.apache.flink.api.java.typeutils.AvroTypeInfo.<init>(AvroTypeInfo.java:48)
        at
org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1810)
        at
org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1716)
        at
org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:953)
        at
org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:814)
        at
org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfo(TypeExtractor.java:768)
        at
org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfo(TypeExtractor.java:764)
        at
org.apache.flink.api.java.typeutils.runtime.PojoSerializer.createSubclassSerializer(PojoSerializer.java:1129)
        at
org.apache.flink.api.java.typeutils.runtime.PojoSerializer.getSubclassSerializer(PojoSerializer.java:1122)
        at
org.apache.flink.api.java.typeutils.runtime.PojoSerializer.copy(PojoSerializer.java:253)
        at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:526)
        at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
        at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
        at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
        at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
        at
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollectWithTimestamp(StreamSourceContexts.java:309)
        at
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collectWithTimestamp(StreamSourceContexts.java:408)
        at
org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.emitRecordAndUpdateState(KinesisDataFetcher.java:486)
        at
org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.deserializeRecordForCollectionAndUpdateState(ShardConsumer.java:263)
        at
org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:209)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.xxx.yyy.zzz$Builder
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 31 more




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: NoClassDefFoundError of a Avro class after cancel then resubmit the same job

Edward
Yes, we've seen this issue as well, though it usually takes many more
resubmits before the error pops up. Interestingly, of the 7 jobs we run (all
of which use different Avro schemas), we only see this issue on 1 of them.
Once the NoClassDefFoundError crops up though, it is necessary to recreate
the task managers.

There's a whole page on the Flink documentation on debugging classloading,
and Avro is mentioned several times on that page:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/debugging_classloading.html

It seems that (in 1.3 at least) each submitted job has its own classloader,
and its own instance of the Avro class definitions. However, the Avro class
cache will keep references to the Avro classes from classloaders for the
previous cancelled jobs. That said, we haven't been able to find a solution
to this error yet. Flink 1.4 would be worth a try because the of the changes
to the default classloading behaviour (child-first is the new default, not
parent-first).





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: NoClassDefFoundError of a Avro class after cancel then resubmit the same job

Stephan Ewen
Hi!

We changed a few things between 1.3 and 1.4 concerning Avro. One of the main things is that Avro is no longer part of the core Flink class library, but needs to be packaged into your application jar file.

The class loading / caching issues of 1.3 with respect to Avro should disappear in Flink 1.4, because Avro classes and caches are scoped to the job classloaders, so the caches do not go across different jobs, or even different operators.


Please check: Make sure you have Avro as a dependency in your jar file (in scope "compile").

Hope that solves the issue.

Stephan


On Mon, Jan 22, 2018 at 2:31 PM, Edward <[hidden email]> wrote:
Yes, we've seen this issue as well, though it usually takes many more
resubmits before the error pops up. Interestingly, of the 7 jobs we run (all
of which use different Avro schemas), we only see this issue on 1 of them.
Once the NoClassDefFoundError crops up though, it is necessary to recreate
the task managers.

There's a whole page on the Flink documentation on debugging classloading,
and Avro is mentioned several times on that page:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/debugging_classloading.html

It seems that (in 1.3 at least) each submitted job has its own classloader,
and its own instance of the Avro class definitions. However, the Avro class
cache will keep references to the Avro classes from classloaders for the
previous cancelled jobs. That said, we haven't been able to find a solution
to this error yet. Flink 1.4 would be worth a try because the of the changes
to the default classloading behaviour (child-first is the new default, not
parent-first).

Reply | Threaded
Open this post in threaded view
|

Re: NoClassDefFoundError of a Avro class after cancel then resubmit the same job

Elias Levy
I am currently suffering through similar issues.  

Had a job running happily, but when it the cluster tried to restarted it would not find the JSON serializer in it. The job kept trying to restart in a loop.

Just today I was running a job I built locally.  The job ran fine.  I added two commits and rebuilt the jar.  Now the job dies when it tries to start claiming it can't find the time assigner class.  I've unzipped the job jar, both locally and in the TM blob directory and have confirmed the class is in it.

This is the backtrace:

java.lang.ClassNotFoundException: com.foo.flink.common.util.TimeAssigner
	at java.net.URLClassLoader.findClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Unknown Source)
	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:73)
	at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source)
	at java.io.ObjectInputStream.readClassDesc(Unknown Source)
	at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
	at java.io.ObjectInputStream.readObject0(Unknown Source)
	at java.io.ObjectInputStream.readObject(Unknown Source)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:393)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:380)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:368)
	at org.apache.flink.util.SerializedValue.deserializeValue(SerializedValue.java:58)
	at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.createPartitionStateHolders(AbstractFetcher.java:542)
	at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.<init>(AbstractFetcher.java:167)
	at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.<init>(Kafka09Fetcher.java:89)
	at org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.<init>(Kafka010Fetcher.java:62)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010.createFetcher(FlinkKafkaConsumer010.java:203)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:564)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:86)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:94)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Unknown Source)

On Tue, Jan 23, 2018 at 7:51 AM, Stephan Ewen <[hidden email]> wrote:
Hi!

We changed a few things between 1.3 and 1.4 concerning Avro. One of the main things is that Avro is no longer part of the core Flink class library, but needs to be packaged into your application jar file.

The class loading / caching issues of 1.3 with respect to Avro should disappear in Flink 1.4, because Avro classes and caches are scoped to the job classloaders, so the caches do not go across different jobs, or even different operators.


Please check: Make sure you have Avro as a dependency in your jar file (in scope "compile").

Hope that solves the issue.

Stephan


On Mon, Jan 22, 2018 at 2:31 PM, Edward <[hidden email]> wrote:
Yes, we've seen this issue as well, though it usually takes many more
resubmits before the error pops up. Interestingly, of the 7 jobs we run (all
of which use different Avro schemas), we only see this issue on 1 of them.
Once the NoClassDefFoundError crops up though, it is necessary to recreate
the task managers.

There's a whole page on the Flink documentation on debugging classloading,
and Avro is mentioned several times on that page:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/debugging_classloading.html

It seems that (in 1.3 at least) each submitted job has its own classloader,
and its own instance of the Avro class definitions. However, the Avro class
cache will keep references to the Avro classes from classloaders for the
previous cancelled jobs. That said, we haven't been able to find a solution
to this error yet. Flink 1.4 would be worth a try because the of the changes
to the default classloading behaviour (child-first is the new default, not
parent-first).


Reply | Threaded
Open this post in threaded view
|

Re: NoClassDefFoundError of a Avro class after cancel then resubmit the same job

Elias Levy
Something seems to be off with the user code class loader.  The only way I can get my job to start is if I drop the job into the lib folder in the JM and configure the JM's classloader.resolve-order to parent-first.

Suggestions?

On Thu, Feb 22, 2018 at 12:52 PM, Elias Levy <[hidden email]> wrote:
I am currently suffering through similar issues.  

Had a job running happily, but when it the cluster tried to restarted it would not find the JSON serializer in it. The job kept trying to restart in a loop.

Just today I was running a job I built locally.  The job ran fine.  I added two commits and rebuilt the jar.  Now the job dies when it tries to start claiming it can't find the time assigner class.  I've unzipped the job jar, both locally and in the TM blob directory and have confirmed the class is in it.

This is the backtrace:

java.lang.ClassNotFoundException: com.foo.flink.common.util.TimeAssigner
	at java.net.URLClassLoader.findClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Unknown Source)
	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:73)
	at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source)
	at java.io.ObjectInputStream.readClassDesc(Unknown Source)
	at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
	at java.io.ObjectInputStream.readObject0(Unknown Source)
	at java.io.ObjectInputStream.readObject(Unknown Source)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:393)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:380)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:368)
	at org.apache.flink.util.SerializedValue.deserializeValue(SerializedValue.java:58)
	at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.createPartitionStateHolders(AbstractFetcher.java:542)
	at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.<init>(AbstractFetcher.java:167)
	at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.<init>(Kafka09Fetcher.java:89)
	at org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.<init>(Kafka010Fetcher.java:62)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010.createFetcher(FlinkKafkaConsumer010.java:203)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:564)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:86)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:94)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Unknown Source)

On Tue, Jan 23, 2018 at 7:51 AM, Stephan Ewen <[hidden email]> wrote:
Hi!

We changed a few things between 1.3 and 1.4 concerning Avro. One of the main things is that Avro is no longer part of the core Flink class library, but needs to be packaged into your application jar file.

The class loading / caching issues of 1.3 with respect to Avro should disappear in Flink 1.4, because Avro classes and caches are scoped to the job classloaders, so the caches do not go across different jobs, or even different operators.


Please check: Make sure you have Avro as a dependency in your jar file (in scope "compile").

Hope that solves the issue.

Stephan


On Mon, Jan 22, 2018 at 2:31 PM, Edward <[hidden email]> wrote:
Yes, we've seen this issue as well, though it usually takes many more
resubmits before the error pops up. Interestingly, of the 7 jobs we run (all
of which use different Avro schemas), we only see this issue on 1 of them.
Once the NoClassDefFoundError crops up though, it is necessary to recreate
the task managers.

There's a whole page on the Flink documentation on debugging classloading,
and Avro is mentioned several times on that page:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/debugging_classloading.html

It seems that (in 1.3 at least) each submitted job has its own classloader,
and its own instance of the Avro class definitions. However, the Avro class
cache will keep references to the Avro classes from classloaders for the
previous cancelled jobs. That said, we haven't been able to find a solution
to this error yet. Flink 1.4 would be worth a try because the of the changes
to the default classloading behaviour (child-first is the new default, not
parent-first).



Reply | Threaded
Open this post in threaded view
|

Re: NoClassDefFoundError of a Avro class after cancel then resubmit the same job

Aljoscha Krettek
@Elias This is a know issue that will be fixed in 1.4.2 which we will do very quickly just because of this bug: https://issues.apache.org/jira/browse/FLINK-8741.

On 23. Feb 2018, at 05:53, Elias Levy <[hidden email]> wrote:

Something seems to be off with the user code class loader.  The only way I can get my job to start is if I drop the job into the lib folder in the JM and configure the JM's classloader.resolve-order to parent-first.

Suggestions?

On Thu, Feb 22, 2018 at 12:52 PM, Elias Levy <[hidden email]> wrote:
I am currently suffering through similar issues.  

Had a job running happily, but when it the cluster tried to restarted it would not find the JSON serializer in it. The job kept trying to restart in a loop.

Just today I was running a job I built locally.  The job ran fine.  I added two commits and rebuilt the jar.  Now the job dies when it tries to start claiming it can't find the time assigner class.  I've unzipped the job jar, both locally and in the TM blob directory and have confirmed the class is in it.

This is the backtrace:

java.lang.ClassNotFoundException: com.foo.flink.common.util.TimeAssigner
	at java.net.URLClassLoader.findClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Unknown Source)
	at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:73)
	at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source)
	at java.io.ObjectInputStream.readClassDesc(Unknown Source)
	at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
	at java.io.ObjectInputStream.readObject0(Unknown Source)
	at java.io.ObjectInputStream.readObject(Unknown Source)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:393)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:380)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:368)
	at org.apache.flink.util.SerializedValue.deserializeValue(SerializedValue.java:58)
	at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.createPartitionStateHolders(AbstractFetcher.java:542)
	at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.<init>(AbstractFetcher.java:167)
	at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.<init>(Kafka09Fetcher.java:89)
	at org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.<init>(Kafka010Fetcher.java:62)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010.createFetcher(FlinkKafkaConsumer010.java:203)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:564)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:86)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:94)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
	at java.lang.Thread.run(Unknown Source)

On Tue, Jan 23, 2018 at 7:51 AM, Stephan Ewen <[hidden email]> wrote:
Hi!

We changed a few things between 1.3 and 1.4 concerning Avro. One of the main things is that Avro is no longer part of the core Flink class library, but needs to be packaged into your application jar file.

The class loading / caching issues of 1.3 with respect to Avro should disappear in Flink 1.4, because Avro classes and caches are scoped to the job classloaders, so the caches do not go across different jobs, or even different operators.


Please check: Make sure you have Avro as a dependency in your jar file (in scope "compile").

Hope that solves the issue.

Stephan


On Mon, Jan 22, 2018 at 2:31 PM, Edward <[hidden email]> wrote:
Yes, we've seen this issue as well, though it usually takes many more
resubmits before the error pops up. Interestingly, of the 7 jobs we run (all
of which use different Avro schemas), we only see this issue on 1 of them.
Once the NoClassDefFoundError crops up though, it is necessary to recreate
the task managers.

There's a whole page on the Flink documentation on debugging classloading,
and Avro is mentioned several times on that page:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/debugging_classloading.html

It seems that (in 1.3 at least) each submitted job has its own classloader,
and its own instance of the Avro class definitions. However, the Avro class
cache will keep references to the Avro classes from classloaders for the
previous cancelled jobs. That said, we haven't been able to find a solution
to this error yet. Flink 1.4 would be worth a try because the of the changes
to the default classloading behaviour (child-first is the new default, not
parent-first).