Hi all,
I want to propose to have a couple separate Flink distributions with Hive dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions will be provided to users on Flink download page [1].
A few reasons to do this:
1) Flink-Hive integration is important to many many Flink and Hive users in two dimensions:
a) for Flink metadata: HiveCatalog is the only persistent catalog to manage Flink tables. With Flink 1.10 supporting more DDL, the persistent catalog would be playing even more critical role in users' workflow
b) for Flink data: Hive data connector (source/sink) helps both Flink and Hive users to unlock new use cases in streaming, near-realtime/realtime data warehouse, backfill, etc.
2) currently users have to go thru a *really* tedious process to get started, because it requires lots of extra jars (see [2]) that are absent in Flink's lean distribution. We've had so many users from public mailing list, private email, DingTalk groups who got frustrated on spending lots of time figuring out the jars themselves. They would rather have a more "right out of box" quickstart experience, and play with the catalog and source/sink without hassle.
3) it's easier for users to replace those Hive dependencies for their own Hive versions - just replace those jars with the right versions and no need to find the doc.
* Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base out there, and that's why we are using them as examples for dependencies in [1] even though we've supported almost all Hive versions [3] now.
I want to hear what the community think about this, and how to achieve it if we believe that's the way to go.
Cheers,
Bowen