How to infer table schema from Avro file

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to infer table schema from Avro file

Soheil Pourbafrani
Hi, I load an Avro file in a Flink Dataset:

AvroInputFormat<GenericRecord> test = new AvroInputFormat<GenericRecord>(
new Path("PathToAvroFile)
, GenericRecord.class);
DataSet<GenericRecord> DS = env.createInput(test);

usersDS.print();
and here are the results of printing DS:
{"N_NATIONKEY": 14, "N_NAME": "KENYA", "N_REGIONKEY": 0, "N_COMMENT": " pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t"}
{"N_NATIONKEY": 15, "N_NAME": "MOROCCO", "N_REGIONKEY": 0, "N_COMMENT": "rns. blithely bold courts among the closely regular packages use furiously bold platelets?"}
{"N_NATIONKEY": 16, "N_NAME": "MOZAMBIQUE", "N_REGIONKEY": 0, "N_COMMENT": "s. ironic, unusual asymptotes wake blithely r"}
{"N_NATIONKEY": 17, "N_NAME": "PERU", "N_REGIONKEY": 1, "N_COMMENT": "platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun"}
{"N_NATIONKEY": 18, "N_NAME": "CHINA", "N_REGIONKEY": 2, "N_COMMENT": "c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos"}
{"N_NATIONKEY": 19, "N_NAME": "ROMANIA", "N_REGIONKEY": 3, "N_COMMENT": "ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account"}
{"N_NATIONKEY": 20, "N_NAME": "SAUDI ARABIA", "N_REGIONKEY": 4, "N_COMMENT": "ts. silent requests haggle. closely express packages sleep across the blithely"}

Now I want to create a table from DS Dataset with the exactly the same schema of Avro file, I mean columns should be N_NATIONKEY, N_NAME, N_REGIONKEY, and N_COMMENT.

I know using the line:

tableEnv.registerDataSet("tbTest", usersDS, "field1, field2, ...");
I can create a table and set the columns, but I want the columns to be inferred automatically from data. Is it possible?
I tried 
tableEnv.registerDataSet("tbTest", DS);
but it creates a table with the schema:
root
 |-- f0: GenericType<org.apache.avro.generic.GenericRecord>
Reply | Threaded
Open this post in threaded view
|

Re: How to infer table schema from Avro file

Yun Tang
+ Flink Users

From: Yun Tang <[hidden email]>
Sent: Monday, January 28, 2019 19:46
To: Soheil Pourbafrani
Subject: Re: How to infer table schema from Avro file
 
Hi Soheil

You should provide your generated Avro record class as the type of AvroInputFormat not Avro's GenericRecord class. Take an example, if your generated record named 'Nation', the correct way to create input should be:

AvroInputFormat<Nation> test = new AvroInputFormat<>(
new
Path("PathToAvroFile)
, Nation.class);

By doing this, Flink would recognize your input format as 'PojoType' not 'GenericType' which only has one field. And the field of columns would be inferred automatically

Best
Yun Tang

From: Soheil Pourbafrani <[hidden email]>
Sent: Monday, January 28, 2019 5:54
To: user
Subject: How to infer table schema from Avro file
 
Hi, I load an Avro file in a Flink Dataset:

AvroInputFormat<GenericRecord> test = new AvroInputFormat<GenericRecord>(
new Path("PathToAvroFile)
, GenericRecord.class);
DataSet<GenericRecord> DS = env.createInput(test);

usersDS.print();
and here are the results of printing DS:
{"N_NATIONKEY": 14, "N_NAME": "KENYA", "N_REGIONKEY": 0, "N_COMMENT": " pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t"}
{"N_NATIONKEY": 15, "N_NAME": "MOROCCO", "N_REGIONKEY": 0, "N_COMMENT": "rns. blithely bold courts among the closely regular packages use furiously bold platelets?"}
{"N_NATIONKEY": 16, "N_NAME": "MOZAMBIQUE", "N_REGIONKEY": 0, "N_COMMENT": "s. ironic, unusual asymptotes wake blithely r"}
{"N_NATIONKEY": 17, "N_NAME": "PERU", "N_REGIONKEY": 1, "N_COMMENT": "platelets. blithely pending dependencies use fluffily across the even pinto beans. carefully silent accoun"}
{"N_NATIONKEY": 18, "N_NAME": "CHINA", "N_REGIONKEY": 2, "N_COMMENT": "c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos"}
{"N_NATIONKEY": 19, "N_NAME": "ROMANIA", "N_REGIONKEY": 3, "N_COMMENT": "ular asymptotes are about the furious multipliers. express dependencies nag above the ironically ironic account"}
{"N_NATIONKEY": 20, "N_NAME": "SAUDI ARABIA", "N_REGIONKEY": 4, "N_COMMENT": "ts. silent requests haggle. closely express packages sleep across the blithely"}

Now I want to create a table from DS Dataset with the exactly the same schema of Avro file, I mean columns should be N_NATIONKEY, N_NAME, N_REGIONKEY, and N_COMMENT.

I know using the line:

tableEnv.registerDataSet("tbTest", usersDS, "field1, field2, ...");
I can create a table and set the columns, but I want the columns to be inferred automatically from data. Is it possible?
I tried 
tableEnv.registerDataSet("tbTest", DS);
but it creates a table with the schema:
root
 |-- f0: GenericType<org.apache.avro.generic.GenericRecord>