(DEPRECATED) Apache Flink User Mailing List archive.

[DISCUSS] Integrate Flink SQL well with Hive ecosystem

Classic

List

Threaded

23 messages Options

Zhang, Xuefu

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

I have also shared a design doc on Hive metastore integration that is attached here and also to FLINK-10556[1]. Please kindly review and share your feedback.

Thanks,

Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556

------------------------------------------------------------------
Sender:Xuefu <[hidden email]>
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu <[hidden email]>; Shuyi Chen <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556

------------------------------------------------------------------
Sender:Xuefu <[hidden email]>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <[hidden email]>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi

On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <[hidden email]> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types - Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7. SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could include only core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:vino yang <[hidden email]>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <[hidden email]>
Cc:dev <[hidden email]>; Xuefu <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.

Fabian Hueske <[hidden email]> 于2018年10月10日周三下午5:27写道：
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <[hidden email]>:
Hi all,

Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

Regards,

Xuefu

References:

[1] https://issues.apache.org/jira/browse/HIVE-10712
[2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.

--
"So you have to trust that the dots will somehow connect in your future."

=?UTF-8?B?RmxpbmstSGl2ZSBNZXRhc3RvcmUgQ29ubmVjdGl2aXR5IERlc2lnbi5wZGY=?= (202K) Download Attachment

Shuyi Chen

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Thanks a lot for driving this big effort. I would suggest convert your proposal and design doc into a google doc, and share it on the dev mailing list for the community to review and comment with title like "[DISCUSS] ... Hive integration design ..." . Once approved, we can document it as a FLIP (Flink Improvement Proposals), and use JIRAs to track the implementations. What do you think?

Shuyi

On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <[hidden email]> wrote:

Hi all,

I have also shared a design doc on Hive metastore integration that is attached here and also to FLINK-10556[1]. Please kindly review and share your feedback.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
------------------------------------------------------------------
Sender:Xuefu <[hidden email]>
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu <[hidden email]>; Shuyi Chen <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556

------------------------------------------------------------------
Sender:Xuefu <[hidden email]>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <[hidden email]>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi

On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <[hidden email]> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types - Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7. SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could include only core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:vino yang <[hidden email]>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <[hidden email]>
Cc:dev <[hidden email]>; Xuefu <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.

Fabian Hueske <[hidden email]> 于2018年10月10日周三下午5:27写道：
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <[hidden email]>:
Hi all,

Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

Regards,

Xuefu

References:

[1] https://issues.apache.org/jira/browse/HIVE-10712
[2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.

--
"So you have to trust that the dots will somehow connect in your future."

"So you have to trust that the dots will somehow connect in your future."

Zhang, Xuefu

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuiyi,

Good idea. Actually the PDF was converted from a google doc. Here is its link:

https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing

Once we reach an agreement, I can convert it to a FLIP.

Thanks,

Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <[hidden email]>
Sent at:2018 Nov 1 (Thu) 02:47
Recipient:Xuefu <[hidden email]>
Cc:vino yang <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Thanks a lot for driving this big effort. I would suggest convert your proposal and design doc into a google doc, and share it on the dev mailing list for the community to review and comment with title like "[DISCUSS] ... Hive integration design ..." . Once approved, we can document it as a FLIP (Flink Improvement Proposals), and use JIRAs to track the implementations. What do you think?

Shuyi

On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <[hidden email]> wrote:
Hi all,

I have also shared a design doc on Hive metastore integration that is attached here and also to FLINK-10556[1]. Please kindly review and share your feedback.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
------------------------------------------------------------------
Sender:Xuefu <[hidden email]>
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu <[hidden email]>; Shuyi Chen <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556

------------------------------------------------------------------
Sender:Xuefu <[hidden email]>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <[hidden email]>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <[hidden email]>
Cc:yanghua1127 <[hidden email]>; Fabian Hueske <[hidden email]>; dev <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi

On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <[hidden email]> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types - Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7. SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could include only core areas (such as 1, 2, 4, 5, 6, 7) or start at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:vino yang <[hidden email]>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <[hidden email]>
Cc:dev <[hidden email]>; Xuefu <[hidden email]>; user <[hidden email]>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.

Fabian Hueske <[hidden email]> 于2018年10月10日周三下午5:27写道：
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <[hidden email]>:
Hi all,

Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

Regards,

Xuefu

References:

[1] https://issues.apache.org/jira/browse/HIVE-10712
[2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.

--
"So you have to trust that the dots will somehow connect in your future."

--
"So you have to trust that the dots will somehow connect in your future."