AvroParquetWriter issues writing to S3

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

AvroParquetWriter issues writing to S3

Diogo Santos
Hi guys,

I'm using AvroParquetWriter to write parquet files into S3 and when I setup the cluster (starting fresh instances jobmanager/taskmanager etc), the scheduled job starts executing without problems and could write the files into S3 but if the job is canceled and starts again the job throws the exception java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825) at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196) at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88) at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25) at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553) at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) at
....

Environment configuration:
- apache flink 1.10
- scala 2.12
- the uber jar is in the application classloader (/lib) flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
- in plugins folder exists the folder s3-fs-hadoop with the jar flink-s3-fs-hadoop-1.10.0.jar

I can fix this issue adding the dependency joda-time to the flink lib folder and excluding the dependency joda-time from the hadoop-aws that is required by the application code.

Do you know what is the root cause of this? Or if I could do another thing than adding the joda-time dependency on the flink lib folder?

Thanks

--
cumprimentos,
Diogo Santos
Reply | Threaded
Open this post in threaded view
|

Re: AvroParquetWriter issues writing to S3

Till Rohrmann
Hi Diogo,

thanks for reporting this issue. It looks quite strange to be honest. flink-s3-fs-hadoop-1.10.0.jar contains the DateTimeParserBucket class. So either this class wasn't loaded when starting the application from scratch or there could be a problem with the plugin mechanism on restarts. I'm pulling in Arvid who worked on the plugin mechanism and might be able to tell us more. In the meantime, could you provide us with the logs? They might tell us a bit more what happened.

Cheers,
Till

On Wed, Apr 15, 2020 at 5:54 PM Diogo Santos <[hidden email]> wrote:
Hi guys,

I'm using AvroParquetWriter to write parquet files into S3 and when I setup the cluster (starting fresh instances jobmanager/taskmanager etc), the scheduled job starts executing without problems and could write the files into S3 but if the job is canceled and starts again the job throws the exception java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825) at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196) at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88) at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25) at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553) at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) at
....

Environment configuration:
- apache flink 1.10
- scala 2.12
- the uber jar is in the application classloader (/lib) flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
- in plugins folder exists the folder s3-fs-hadoop with the jar flink-s3-fs-hadoop-1.10.0.jar

I can fix this issue adding the dependency joda-time to the flink lib folder and excluding the dependency joda-time from the hadoop-aws that is required by the application code.

Do you know what is the root cause of this? Or if I could do another thing than adding the joda-time dependency on the flink lib folder?

Thanks

--
cumprimentos,
Diogo Santos
Reply | Threaded
Open this post in threaded view
|

Re: AvroParquetWriter issues writing to S3

Till Rohrmann
For future reference, here is the stack trace in an easier to read format:

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825
 at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196
 at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88
 at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121
 at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32
 at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25
 at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738
 at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698
 at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544
 at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524
 at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052
 at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998
 at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335
 at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309
 at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904
 at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553
 at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910
 at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81
 at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246
 at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280
 at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535 undefined) at
....

On Thu, Apr 16, 2020 at 9:26 AM Till Rohrmann <[hidden email]> wrote:
Hi Diogo,

thanks for reporting this issue. It looks quite strange to be honest. flink-s3-fs-hadoop-1.10.0.jar contains the DateTimeParserBucket class. So either this class wasn't loaded when starting the application from scratch or there could be a problem with the plugin mechanism on restarts. I'm pulling in Arvid who worked on the plugin mechanism and might be able to tell us more. In the meantime, could you provide us with the logs? They might tell us a bit more what happened.

Cheers,
Till

On Wed, Apr 15, 2020 at 5:54 PM Diogo Santos <[hidden email]> wrote:
Hi guys,

I'm using AvroParquetWriter to write parquet files into S3 and when I setup the cluster (starting fresh instances jobmanager/taskmanager etc), the scheduled job starts executing without problems and could write the files into S3 but if the job is canceled and starts again the job throws the exception java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket

Caused by: java.lang.NoClassDefFoundError: org/joda/time/format/DateTimeParserBucket at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:825) at com.amazonaws.util.DateUtils.parseRFC822Date(DateUtils.java:196) at com.amazonaws.services.s3.internal.ServiceUtils.parseRfc822Date(ServiceUtils.java:88) at com.amazonaws.services.s3.internal.AbstractS3ResponseHandler.populateObjectMetadata(AbstractS3ResponseHandler.java:121) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:32) at com.amazonaws.services.s3.internal.S3MetadataResponseHandler.handle(S3MetadataResponseHandler.java:25) at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1714) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1434) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1335) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1309) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:904) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1553) at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:246) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:280) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:535) at
....

Environment configuration:
- apache flink 1.10
- scala 2.12
- the uber jar is in the application classloader (/lib) flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
- in plugins folder exists the folder s3-fs-hadoop with the jar flink-s3-fs-hadoop-1.10.0.jar

I can fix this issue adding the dependency joda-time to the flink lib folder and excluding the dependency joda-time from the hadoop-aws that is required by the application code.

Do you know what is the root cause of this? Or if I could do another thing than adding the joda-time dependency on the flink lib folder?

Thanks

--
cumprimentos,
Diogo Santos
Reply | Threaded
Open this post in threaded view
|

Re: AvroParquetWriter issues writing to S3

Diogo Santos
In reply to this post by Till Rohrmann
Hi Till,

definitely seems to be a strange issue. The first time the job is loaded
(with a clean instance of the Cluster) the job goes well, but if it is
canceled or started again the issue came.

I built an example here https://github.com/congd123/flink-s3-example

You can generate the artifact of the Flink Job and start the cluster with
the configuration on the docker-compose.

Thanks for helping







--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: AvroParquetWriter issues writing to S3

Arvid Heise-3
Hi Diogo,

I saw similar issues already. The root cause is always users actually not using any Flink specific stuff, but going to the Parquet Writer of Hadoop directly. As you can see in your stacktrace, there is not one reference to any Flink class.

The solution usually is to use the respective Flink sink instead of bypassing them [1].
If you opt to implement it manually nonetheless, it's probably easier to bundle Hadoop from a non-Flink dependency.


On Thu, Apr 16, 2020 at 5:36 PM Diogo Santos <[hidden email]> wrote:
Hi Till,

definitely seems to be a strange issue. The first time the job is loaded
(with a clean instance of the Cluster) the job goes well, but if it is
canceled or started again the issue came.

I built an example here https://github.com/congd123/flink-s3-example

You can generate the artifact of the Flink Job and start the cluster with
the configuration on the docker-compose.

Thanks for helping







--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng