This post was updated on .
Hello,
I have a stream with each message is a JSON string with a quite complex schema (multiple fields, multiple nested layers), and I need to write that into parquet files after some slight modifications/enrichment. I wonder what options are available for me to do that. I'm thinking of JSON -> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be able to quickly/dynamically (as less code change as possible) change the JSON schema. With Avro, I could see three options: AvroGeneric, AvroSpecific, AvroReflect. Which one should I use? I could also see ParquetRowDataWriter, and thought of using JsonRowDeserializationSchema to have JSON -> Row -> Parquet. Thanks and regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Averell,
If you can describe the JSON schema I'd suggest looking into the SQL API. (And I think you do need to define your schema upfront. If I am not mistaken Parquet must know the common schema.) Then you could do sth like: CREATE TABLE json ( // define the schema of your json data ) WITH ( ... 'format' = 'json', 'json.fail-on-missing-field' = 'false', 'json.ignore-parse-errors' = 'true' ) CREATE TABLE parquet ( // define the schema of your parquet data ) WITH ( 'connector' = 'filesystem', 'path' = '/tmp/parquet', 'format' = 'parquet' ); You might also want to have a look at the LIKE[3] to define the schema of your parquet table if it is mostly similar to the json schema. INSERT INTO parquet SELECT /*transform your data*/ FROM json; Best, Dawid [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/json.html#how-to-create-a-table-with-json-format [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/parquet.html#how-to-create-a-table-with-parquet-format [3] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/create.html#create-table On 21/08/2020 02:40, Averell wrote: > Hello, > > I have a stream with each message is a JSON string with a quite complex > schema (multiple fields, multiple nested layers), and I need to write that > into parquet files after some slight modifications/enrichment. > > I wonder what options are available for me to do that. I'm thinking of JSON > -> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be > able to quickly/dynamically (as less code change as possible) change the > JSON schema. > > Thanks and regards, > Averell > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ signature.asc (849 bytes) Download Attachment |
In reply to this post by Averell
Hi Averell,
If you can describe the JSON schema I'd suggest looking into the SQL API. (And I think you do need to define your schema upfront. If I am not mistaken Parquet must know the common schema.) Then you could do sth like: CREATE TABLE json ( // define the schema of your json data ) WITH ( ... 'format' = 'json', 'json.fail-on-missing-field' = 'false', 'json.ignore-parse-errors' = 'true' ) CREATE TABLE parquet ( // define the schema of your parquet data ) WITH ( 'connector' = 'filesystem', 'path' = '/tmp/parquet', 'format' = 'parquet' ); You might also want to have a look at the LIKE[3] to define the schema of your parquet table if it is mostly similar to the json schema. INSERT INTO parquet SELECT /*transform your data*/ FROM json; Best, Dawid [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/json.html#how-to-create-a-table-with-json-format [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/parquet.html#how-to-create-a-table-with-parquet-format [3] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/create.html#create-table On 21/08/2020 02:40, Averell wrote: > Hello, > > I have a stream with each message is a JSON string with a quite complex > schema (multiple fields, multiple nested layers), and I need to write that > into parquet files after some slight modifications/enrichment. > > I wonder what options are available for me to do that. I'm thinking of JSON > -> AVRO (GenericRecord) -> Parquet. Is that an option? I would want to be > able to quickly/dynamically (as less code change as possible) change the > JSON schema. > > Thanks and regards, > Averell > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ signature.asc (849 bytes) Download Attachment |
Hi Dawid,
Thanks for the suggestion. So, basically I'll need to use the JSON connector to get the JSON strings into Rows, and from Rows to Parquet records using the parquet connecter? I have never tried the TableAPI in the past, have been using the StreamingAPI only. Will follow your suggestion now. Thanks for your help. Regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |