(DEPRECATED) Apache Flink User Mailing List archive.

[PROGRESS UPDATE] [DISCUSS] Flink-Hive Integration and Catalogs

Classic

List

Threaded

3 messages Options

phoenixjiangnan

[PROGRESS UPDATE] [DISCUSS] Flink-Hive Integration and Catalogs

Hi Flink users and devs,

We want to get your feedbacks on integrating Flink with Hive.

Background: In Flink Forward in Beijing last December, the community announced to initiate efforts on integrating Flink and Hive. On Feb 21 Seattle Flink Meetup, We presented Integrating Flink with Hive with a live demo to local community and got great response. As of mid March now, we have internally finished building Flink's brand-new catalog infrastructure, metadata integration with Hive, and most common cases of Flink reading/writing against Hive, and will start to submit more design docs/FLIP and contribute code back to community. The reason for doing it internally first and then in community is to ensure our proposed solutions are fully validated and tested, gain hands-on experience and not miss anything in design. You are very welcome to join this effort, from design/code review, to development and testing.

The most important thing we believe you, our Flink users/devs, can help RIGHT NOW is to share your Hive use cases and give us feedbacks for this project. As we start to go deeper on specific areas of integration, you feedbacks and suggestions will help us to refine our backlogs and prioritize our work, and you can get the features you want sooner! Just for example, if most users is mainly only reading Hive data, then we can prioritize tuning read performance over implementing write capability.

A quick review of what we've finished building internally and is ready to contribute back to community:

Flink/Hive Metadata Integration

Unified, pluggable catalog infra that manages meta-objects, including catalogs, databases, tables, views, functions, partitions, table/partition stats
Three catalog impls - A in-memory catalog, HiveCatalog for embracing Hive ecosystem, GenericHiveMetastoreCatalog for persisting Flink's streaming/batch metadata in Hive metastore
Hierarchical metadata reference as <catalog_name>.<database_name>.<metaobject_name> in SQL and Table API
Unified function catalog based on new catalog infra, also support Hive simple UDF

Flink/Hive Data Integration

Hive data connector that reads partitioned/non-partitioned Hive tables, and supports partition pruning, both Hive simple and complex data types, and basic write

More powerful SQL Client fully integrated with the above features and more Hive-compatible SQL syntax for better end-to-end SQL experience

Given above info, we want to learn from you on: How do you use Hive currently? How can we solve your pain points? What features do you expect from Flink-Hive integration? Those can be details like:

Which Hive version are you using? Do you plan to upgrade Hive?
Are you planning to switch Hive engine? What timeline are you looking at? Until what capabilities Flink has will you consider using Flink with Hive?
What's your motivation to try Flink-Hive? Maintain only one data processing system across your teams for simplicity and maintainability? Better performance of Flink over Hive itself?
What are your Hive use cases? How large is your Hive data size? Do you mainly do reading, or both reading and writing?
How many Hive user defined functions do you have? Are they mostly UDF, GenericUDF, or UDTF, or UDAF?
any questions or suggestions you have? or as simple as how you feel about the project

Again, your input will be really valuable to us, and we hope, with all of us working together, the project can benefits our end users. Please feel free to either reply to this thread or just to me. I'm also working on creating a questionnaire to better gather your feedbacks, watch for the maillist in the next couple days.

Thanks,

Bowen

Shaoxuan Wang

Re: [PROGRESS UPDATE] [DISCUSS] Flink-Hive Integration and Catalogs

Hi Bowen,

Thanks for driving this. I am CCing this email/survey to user-zh@flink.apache.org as well.

I heard there are lots of interests on Flink-Hive from the field. One of the biggest requests the hive users are raised is "the support of out-of-date hive version". A large amount of users are still working on the cluster with CDH/HDP installed with old hive version, say 1.2.1/2.1.1. We need ensure the support of these Hive version when planning the work on Flink-Hive integration.

@all. "We want to get your feedbacks on Flink-Hive integration."

Regards,

Shaoxuan

On Wed, Mar 20, 2019 at 7:16 AM Bowen Li <[hidden email]> wrote:

Hi Flink users and devs,

We want to get your feedbacks on integrating Flink with Hive.

Background: In Flink Forward in Beijing last December, the community announced to initiate efforts on integrating Flink and Hive. On Feb 21 Seattle Flink Meetup, We presented Integrating Flink with Hive with a live demo to local community and got great response. As of mid March now, we have internally finished building Flink's brand-new catalog infrastructure, metadata integration with Hive, and most common cases of Flink reading/writing against Hive, and will start to submit more design docs/FLIP and contribute code back to community. The reason for doing it internally first and then in community is to ensure our proposed solutions are fully validated and tested, gain hands-on experience and not miss anything in design. You are very welcome to join this effort, from design/code review, to development and testing.

The most important thing we believe you, our Flink users/devs, can help RIGHT NOW is to share your Hive use cases and give us feedbacks for this project. As we start to go deeper on specific areas of integration, you feedbacks and suggestions will help us to refine our backlogs and prioritize our work, and you can get the features you want sooner! Just for example, if most users is mainly only reading Hive data, then we can prioritize tuning read performance over implementing write capability.
A quick review of what we've finished building internally and is ready to contribute back to community:
Flink/Hive Metadata Integration
Unified, pluggable catalog infra that manages meta-objects, including catalogs, databases, tables, views, functions, partitions, table/partition stats
Three catalog impls - A in-memory catalog, HiveCatalog for embracing Hive ecosystem, GenericHiveMetastoreCatalog for persisting Flink's streaming/batch metadata in Hive metastore
Hierarchical metadata reference as <catalog_name>.<database_name>.<metaobject_name> in SQL and Table API
Unified function catalog based on new catalog infra, also support Hive simple UDF
Flink/Hive Data Integration
Hive data connector that reads partitioned/non-partitioned Hive tables, and supports partition pruning, both Hive simple and complex data types, and basic write
More powerful SQL Client fully integrated with the above features and more Hive-compatible SQL syntax for better end-to-end SQL experience
Given above info, we want to learn from you on: How do you use Hive currently? How can we solve your pain points? What features do you expect from Flink-Hive integration? Those can be details like:
Which Hive version are you using? Do you plan to upgrade Hive?
Are you planning to switch Hive engine? What timeline are you looking at? Until what capabilities Flink has will you consider using Flink with Hive?
What's your motivation to try Flink-Hive? Maintain only one data processing system across your teams for simplicity and maintainability? Better performance of Flink over Hive itself?
What are your Hive use cases? How large is your Hive data size? Do you mainly do reading, or both reading and writing?
How many Hive user defined functions do you have? Are they mostly UDF, GenericUDF, or UDTF, or UDAF?
any questions or suggestions you have? or as simple as how you feel about the project
Again, your input will be really valuable to us, and we hope, with all of us working together, the project can benefits our end users. Please feel free to either reply to this thread or just to me. I'm also working on creating a questionnaire to better gather your feedbacks, watch for the maillist in the next couple days.

Thanks,
Bowen

phoenixjiangnan

Re: [PROGRESS UPDATE] [DISCUSS] Flink-Hive Integration and Catalogs

Thanks, Shaoxuan! I've sent a Chinese version to user-zh at the same time yesterday.

From feedbacks we received so far, supporting multiple older hive versions is definitely one of our focuses next.

More feedbacks are welcome from our community!

On Tue, Mar 19, 2019 at 8:44 PM Shaoxuan Wang <[hidden email]> wrote:

Hi Bowen,
Thanks for driving this. I am CCing this email/survey to user-zh@flink.apache.org as well.
I heard there are lots of interests on Flink-Hive from the field. One of the biggest requests the hive users are raised is "the support of out-of-date hive version". A large amount of users are still working on the cluster with CDH/HDP installed with old hive version, say 1.2.1/2.1.1. We need ensure the support of these Hive version when planning the work on Flink-Hive integration.

@all. "We want to get your feedbacks on Flink-Hive integration."

Regards,
Shaoxuan

On Wed, Mar 20, 2019 at 7:16 AM Bowen Li <[hidden email]> wrote:
Hi Flink users and devs,

We want to get your feedbacks on integrating Flink with Hive.

Background: In Flink Forward in Beijing last December, the community announced to initiate efforts on integrating Flink and Hive. On Feb 21 Seattle Flink Meetup, We presented Integrating Flink with Hive with a live demo to local community and got great response. As of mid March now, we have internally finished building Flink's brand-new catalog infrastructure, metadata integration with Hive, and most common cases of Flink reading/writing against Hive, and will start to submit more design docs/FLIP and contribute code back to community. The reason for doing it internally first and then in community is to ensure our proposed solutions are fully validated and tested, gain hands-on experience and not miss anything in design. You are very welcome to join this effort, from design/code review, to development and testing.

The most important thing we believe you, our Flink users/devs, can help RIGHT NOW is to share your Hive use cases and give us feedbacks for this project. As we start to go deeper on specific areas of integration, you feedbacks and suggestions will help us to refine our backlogs and prioritize our work, and you can get the features you want sooner! Just for example, if most users is mainly only reading Hive data, then we can prioritize tuning read performance over implementing write capability.
A quick review of what we've finished building internally and is ready to contribute back to community:
Flink/Hive Metadata Integration
Unified, pluggable catalog infra that manages meta-objects, including catalogs, databases, tables, views, functions, partitions, table/partition stats
Three catalog impls - A in-memory catalog, HiveCatalog for embracing Hive ecosystem, GenericHiveMetastoreCatalog for persisting Flink's streaming/batch metadata in Hive metastore
Hierarchical metadata reference as <catalog_name>.<database_name>.<metaobject_name> in SQL and Table API
Unified function catalog based on new catalog infra, also support Hive simple UDF
Flink/Hive Data Integration
Hive data connector that reads partitioned/non-partitioned Hive tables, and supports partition pruning, both Hive simple and complex data types, and basic write
More powerful SQL Client fully integrated with the above features and more Hive-compatible SQL syntax for better end-to-end SQL experience
Given above info, we want to learn from you on: How do you use Hive currently? How can we solve your pain points? What features do you expect from Flink-Hive integration? Those can be details like:
Which Hive version are you using? Do you plan to upgrade Hive?
Are you planning to switch Hive engine? What timeline are you looking at? Until what capabilities Flink has will you consider using Flink with Hive?
What's your motivation to try Flink-Hive? Maintain only one data processing system across your teams for simplicity and maintainability? Better performance of Flink over Hive itself?
What are your Hive use cases? How large is your Hive data size? Do you mainly do reading, or both reading and writing?
How many Hive user defined functions do you have? Are they mostly UDF, GenericUDF, or UDTF, or UDAF?
any questions or suggestions you have? or as simple as how you feel about the project
Again, your input will be really valuable to us, and we hope, with all of us working together, the project can benefits our end users. Please feel free to either reply to this thread or just to me. I'm also working on creating a questionnaire to better gather your feedbacks, watch for the maillist in the next couple days.

Thanks,
Bowen