(DEPRECATED) Apache Flink User Mailing List archive.

AvroParquetWriter issues writing to S3

Classic

List

Threaded

5 messages Options

Diogo Santos

AvroParquetWriter issues writing to S3

Hi guys,

I'm using AvroParquetWriter to write parquet files into S3 and when I setup the cluster (starting fresh instances jobmanager/taskmanager etc), the scheduled job starts executing without problems and could write the files into S3 but if the job is canceled and starts again the job throws the exception java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825) at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196) at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88) at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25) at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553) at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) at
....

Environment configuration:

- apache flink 1.10

- scala 2.12

- the uber jar is in the application classloader (/lib) flink-shaded-hadoop-2-uber-2.8.3-10.0.jar

- in plugins folder exists the folder s3-fs-hadoop with the jar flink-s3-fs-hadoop-1.10.0.jar

I can fix this issue adding the dependency joda-time to the flink lib folder and excluding the dependency joda-time from the hadoop-aws that is required by the application code.

Do you know what is the root cause of this? Or if I could do another thing than adding the joda-time dependency on the flink lib folder?

Thanks

cumprimentos,

Diogo Santos

Till Rohrmann

Re: AvroParquetWriter issues writing to S3

Hi Diogo,

thanks for reporting this issue. It looks quite strange to be honest. flink-s3-fs-hadoop-1.10.0.jar contains the DateTimeParserBucket class. So either this class wasn't loaded when starting the application from scratch or there could be a problem with the plugin mechanism on restarts. I'm pulling in Arvid who worked on the plugin mechanism and might be able to tell us more. In the meantime, could you provide us with the logs? They might tell us a bit more what happened.

Cheers,

Till

On Wed, Apr 15, 2020 at 5:54 PM Diogo Santos <[hidden email]> wrote:

Hi guys,

I'm using AvroParquetWriter to write parquet files into S3 and when I setup the cluster (starting fresh instances jobmanager/taskmanager etc), the scheduled job starts executing without problems and could write the files into S3 but if the job is canceled and starts again the job throws the exception java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825) at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196) at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88) at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25) at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553) at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) at
....

Environment configuration:
- apache flink 1.10
- scala 2.12
- the uber jar is in the application classloader (/lib) flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
- in plugins folder exists the folder s3-fs-hadoop with the jar flink-s3-fs-hadoop-1.10.0.jar

I can fix this issue adding the dependency joda-time to the flink lib folder and excluding the dependency joda-time from the hadoop-aws that is required by the application code.

Do you know what is the root cause of this? Or if I could do another thing than adding the joda-time dependency on the flink lib folder?

Thanks

--
cumprimentos,
Diogo Santos

Till Rohrmann

Re: AvroParquetWriter issues writing to S3

For future reference, here is the stack trace in an easier to read format:

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825
at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196
at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88
at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121
at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32
at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25
at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553
at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910
at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535 undefined) at
....

On Thu, Apr 16, 2020 at 9:26 AM Till Rohrmann <[hidden email]> wrote:

Hi Diogo,

thanks for reporting this issue. It looks quite strange to be honest. flink-s3-fs-hadoop-1.10.0.jar contains the DateTimeParserBucket class. So either this class wasn't loaded when starting the application from scratch or there could be a problem with the plugin mechanism on restarts. I'm pulling in Arvid who worked on the plugin mechanism and might be able to tell us more. In the meantime, could you provide us with the logs? They might tell us a bit more what happened.

Cheers,
Till

On Wed, Apr 15, 2020 at 5:54 PM Diogo Santos <[hidden email]> wrote:
Hi guys,

I'm using AvroParquetWriter to write parquet files into S3 and when I setup the cluster (starting fresh instances jobmanager/taskmanager etc), the scheduled job starts executing without problems and could write the files into S3 but if the job is canceled and starts again the job throws the exception java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825) at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196) at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88) at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25) at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553) at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) at
....

Environment configuration:
- apache flink 1.10
- scala 2.12
- the uber jar is in the application classloader (/lib) flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
- in plugins folder exists the folder s3-fs-hadoop with the jar flink-s3-fs-hadoop-1.10.0.jar

I can fix this issue adding the dependency joda-time to the flink lib folder and excluding the dependency joda-time from the hadoop-aws that is required by the application code.

Do you know what is the root cause of this? Or if I could do another thing than adding the joda-time dependency on the flink lib folder?

Thanks

--
cumprimentos,
Diogo Santos

Diogo Santos

Re: AvroParquetWriter issues writing to S3

In reply to this post by Till Rohrmann

Hi Till,

definitely seems to be a strange issue. The first time the job is loaded
(with a clean instance of the Cluster) the job goes well, but if it is
canceled or started again the issue came.

I built an example here https://github.com/congd123/flink-s3-example

You can generate the artifact of the Flink Job and start the cluster with
the configuration on the docker-compose.

Thanks for helping

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Arvid Heise-3

Re: AvroParquetWriter issues writing to S3

Hi Diogo,

I saw similar issues already. The root cause is always users actually not using any Flink specific stuff, but going to the Parquet Writer of Hadoop directly. As you can see in your stacktrace, there is not one reference to any Flink class.

The solution usually is to use the respective Flink sink instead of bypassing them [1].

If you opt to implement it manually nonetheless, it's probably easier to bundle Hadoop from a non-Flink dependency.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/streamfile_sink.html

On Thu, Apr 16, 2020 at 5:36 PM Diogo Santos <[hidden email]> wrote:

Hi Till,

definitely seems to be a strange issue. The first time the job is loaded
(with a clean instance of the Cluster) the job goes well, but if it is
canceled or started again the issue came.

I built an example here https://github.com/congd123/flink-s3-example

You can generate the artifact of the Flink Job and start the cluster with
the configuration on the docker-compose.

Thanks for helping

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Arvid Heise | Senior Java Developer

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng