Avro schema

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Avro schema

Sumeet Malhotra
Hi,

Is it possible to directly import Avro schema while ingesting data into Flink? Or do we always have to specify the entire schema in either SQL DDL for Table API or using DataStream data types? From a code maintenance standpoint, it would be really helpful to keep one source of truth for the schema somewhere.

Thanks,
Sumeet
Reply | Threaded
Open this post in threaded view
|

Re: Avro schema

Sumeet Malhotra
Just realized, my question was probably not clear enough. :-)

I understand that the Avro (or JSON for that matter) format can be ingested as described here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect.html#apache-avro-format, but this still requires the entire table specification to be written in the "CREATE TABLE" section. Is it possible to just specify the Avro schema and let Flink map it to an SQL table?

BTW, the above link is titled "Table API Legacy Connectors", so is this still supported? Same question for YAML specification.

Thanks,
Sumeet

On Fri, Apr 2, 2021 at 8:26 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

Is it possible to directly import Avro schema while ingesting data into Flink? Or do we always have to specify the entire schema in either SQL DDL for Table API or using DataStream data types? From a code maintenance standpoint, it would be really helpful to keep one source of truth for the schema somewhere.

Thanks,
Sumeet
Reply | Threaded
Open this post in threaded view
|

Re: Avro schema

Paul Lam
Hi Sumeet,

I’m not a Table/SQL API expert, but from my knowledge, it’s not viable to derived SQL table schemas from Avro schemas, because table schemas would be the ground truth by design. 
Moreover, one Avro type can be mapped to multiple Flink types, so in practice maybe it’s also not viable.

Best,
Paul Lam

2021年4月2日 11:34,Sumeet Malhotra <[hidden email]> 写道:

Just realized, my question was probably not clear enough. :-)

I understand that the Avro (or JSON for that matter) format can be ingested as described here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect.html#apache-avro-format, but this still requires the entire table specification to be written in the "CREATE TABLE" section. Is it possible to just specify the Avro schema and let Flink map it to an SQL table?

BTW, the above link is titled "Table API Legacy Connectors", so is this still supported? Same question for YAML specification.

Thanks,
Sumeet

On Fri, Apr 2, 2021 at 8:26 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

Is it possible to directly import Avro schema while ingesting data into Flink? Or do we always have to specify the entire schema in either SQL DDL for Table API or using DataStream data types? From a code maintenance standpoint, it would be really helpful to keep one source of truth for the schema somewhere.

Thanks,
Sumeet

Reply | Threaded
Open this post in threaded view
|

Re: Avro schema

Arvid Heise-4
Hi Sumeet,

The beauty of Avro lies in having reader and writer schema and schema compatibility, such that if your schema evolves over time (which will happen in streaming naturally but is also very common in batch), you can still use your application as is without modification. For streaming, this methodology also implies that you can process elements with different schema versions in the same run, which is mandatory for any non-toy example.

If you read into this topic, you will realize that it doesn't make sense to read from Avro without specifying your reader schema (except for some generic applications, but they should be written in DataStream). If you keep in mind that your same dataset could have different schemas, you will notice that your ideas quickly reach some limitations (which schema to take?). What you could do, is to write a small script to generate the schema DDL from your current schema in your actual data if you have very many columns and datasets. It certainly would also be an interesting idea to pass a static Avro/Json schema to the DDL.

On Fri, Apr 2, 2021 at 10:57 AM Paul Lam <[hidden email]> wrote:
Hi Sumeet,

I’m not a Table/SQL API expert, but from my knowledge, it’s not viable to derived SQL table schemas from Avro schemas, because table schemas would be the ground truth by design. 
Moreover, one Avro type can be mapped to multiple Flink types, so in practice maybe it’s also not viable.

Best,
Paul Lam

2021年4月2日 11:34,Sumeet Malhotra <[hidden email]> 写道:

Just realized, my question was probably not clear enough. :-)

I understand that the Avro (or JSON for that matter) format can be ingested as described here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect.html#apache-avro-format, but this still requires the entire table specification to be written in the "CREATE TABLE" section. Is it possible to just specify the Avro schema and let Flink map it to an SQL table?

BTW, the above link is titled "Table API Legacy Connectors", so is this still supported? Same question for YAML specification.

Thanks,
Sumeet

On Fri, Apr 2, 2021 at 8:26 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

Is it possible to directly import Avro schema while ingesting data into Flink? Or do we always have to specify the entire schema in either SQL DDL for Table API or using DataStream data types? From a code maintenance standpoint, it would be really helpful to keep one source of truth for the schema somewhere.

Thanks,
Sumeet

Reply | Threaded
Open this post in threaded view
|

Re: Avro schema

Sumeet Malhotra
Hi Arvid,

I certainly appreciate the points you make regarding schema evolution. Actually, I did end up writing an avro2sql script to autogen the DDL in the end.

Thanks,
Sumeet

On Fri, Apr 9, 2021 at 12:13 PM Arvid Heise <[hidden email]> wrote:
Hi Sumeet,

The beauty of Avro lies in having reader and writer schema and schema compatibility, such that if your schema evolves over time (which will happen in streaming naturally but is also very common in batch), you can still use your application as is without modification. For streaming, this methodology also implies that you can process elements with different schema versions in the same run, which is mandatory for any non-toy example.

If you read into this topic, you will realize that it doesn't make sense to read from Avro without specifying your reader schema (except for some generic applications, but they should be written in DataStream). If you keep in mind that your same dataset could have different schemas, you will notice that your ideas quickly reach some limitations (which schema to take?). What you could do, is to write a small script to generate the schema DDL from your current schema in your actual data if you have very many columns and datasets. It certainly would also be an interesting idea to pass a static Avro/Json schema to the DDL.

On Fri, Apr 2, 2021 at 10:57 AM Paul Lam <[hidden email]> wrote:
Hi Sumeet,

I’m not a Table/SQL API expert, but from my knowledge, it’s not viable to derived SQL table schemas from Avro schemas, because table schemas would be the ground truth by design. 
Moreover, one Avro type can be mapped to multiple Flink types, so in practice maybe it’s also not viable.

Best,
Paul Lam

2021年4月2日 11:34,Sumeet Malhotra <[hidden email]> 写道:

Just realized, my question was probably not clear enough. :-)

I understand that the Avro (or JSON for that matter) format can be ingested as described here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect.html#apache-avro-format, but this still requires the entire table specification to be written in the "CREATE TABLE" section. Is it possible to just specify the Avro schema and let Flink map it to an SQL table?

BTW, the above link is titled "Table API Legacy Connectors", so is this still supported? Same question for YAML specification.

Thanks,
Sumeet

On Fri, Apr 2, 2021 at 8:26 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

Is it possible to directly import Avro schema while ingesting data into Flink? Or do we always have to specify the entire schema in either SQL DDL for Table API or using DataStream data types? From a code maintenance standpoint, it would be really helpful to keep one source of truth for the schema somewhere.

Thanks,
Sumeet