parquet protobuf output and aws athena support

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

parquet protobuf output and aws athena support

Jin Yi
using ParquetProtoWriters, does anyone have this working with aws athena ingestion via aws glue crawls?

the parquet files being generated by our flink job looks fine at a binary level, but aws glue crawler crawls over these files via s3 don't seem to be able to deserialize the row data properly.  the schema is correctly picked up, but the actual unmarshalling of the rows seems to fail (with no helpful logs).

likewise, using parquet-tools or pqrs locally has the same behavior of readinging the metadata perfectly fine, but the actual data does not.

i'd like to verify that this is just a relatively atypical combination of formats (parquet and protos) that doesn't have widespread tooling support vs something i'm overlooking on my end.  for example, must i define the table manually in athena using a create table statement (most examples of parquet/protobuf uses this approach) and not rely on the schema defined by the aws glue crawler?  i didn't go this route because this seemed counter to the spirit of the parquet format being embedded w/ the schema.

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: parquet protobuf output and aws athena support

Arvid Heise-4
Hi Jin,

I have no experience with your combination. Did you check if you can read the file in a standalone java format? That may help to provide you some meaningful logs.

On Mon, Mar 15, 2021 at 8:51 PM Jin Yi <[hidden email]> wrote:
using ParquetProtoWriters, does anyone have this working with aws athena ingestion via aws glue crawls?

the parquet files being generated by our flink job looks fine at a binary level, but aws glue crawler crawls over these files via s3 don't seem to be able to deserialize the row data properly.  the schema is correctly picked up, but the actual unmarshalling of the rows seems to fail (with no helpful logs).

likewise, using parquet-tools or pqrs locally has the same behavior of readinging the metadata perfectly fine, but the actual data does not.

i'd like to verify that this is just a relatively atypical combination of formats (parquet and protos) that doesn't have widespread tooling support vs something i'm overlooking on my end.  for example, must i define the table manually in athena using a create table statement (most examples of parquet/protobuf uses this approach) and not rely on the schema defined by the aws glue crawler?  i didn't go this route because this seemed counter to the spirit of the parquet format being embedded w/ the schema.

thanks!