Add Bucket File System Table Sink

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Add Bucket File System Table Sink

zhangjun
Hello everyone:
I am a user and fan of flink. I also want to join the flink community. I contributed my first PR a few days ago. Can anyone help me to review my code? If there is something wrong, hope I would be grateful if you can give some advice.

This PR is mainly in the process of development, I use sql to read data from kafka and then write to hdfs, I found that there is no suitable tablesink, I found the document and found that File System Connector is only experimental (https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector), so I wrote a Bucket File System Table Sink that supports writing stream data. Hdfs, file file system, data format supports json, csv, parquet, avro. Subsequently add other format support, such as protobuf, thrift, etc.

In addition, I also added documentation, python api, units test, end-end-test, sql-client, DDL, and compiled on travis.

thank you very much


Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Kurt Young
Hi Jun,

Thanks for bringing this up, in general I'm +1 on this feature. As
you might know, there is another ongoing efforts about such kind
of table sink, which covered in newly proposed partition support
reworking[1]. In this proposal, we also want to introduce a new 
file system connector, which can not only cover the partition 
support, but also end-to-end exactly once in streaming mode. 

I would suggest we could combine these two efforts into one. The 
benefits would be save some review efforts, also reduce the core
connector number to ease our maintaining effort in the future. 
What do you think?

BTW, BucketingSink is already deprecated, I think we should refer
to StreamingFileSink instead. 

Best,
Kurt


On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]> wrote:
Hello everyone:
I am a user and fan of flink. I also want to join the flink community. I contributed my first PR a few days ago. Can anyone help me to review my code? If there is something wrong, hope I would be grateful if you can give some advice.

This PR is mainly in the process of development, I use sql to read data from kafka and then write to hdfs, I found that there is no suitable tablesink, I found the document and found that File System Connector is only experimental (https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector), so I wrote a Bucket File System Table Sink that supports writing stream data. Hdfs, file file system, data format supports json, csv, parquet, avro. Subsequently add other format support, such as protobuf, thrift, etc.

In addition, I also added documentation, python api, units test, end-end-test, sql-client, DDL, and compiled on travis.

thank you very much


Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Kurt Young
Thanks. Let me clarify a bit more about my thinkings. Generally, I would
prefer we can concentrate the functionalities about connector, especially
some standard & most popular connectors, like kafka, different file
system with different formats, etc. We should make these core connectors
as powerful as we can, and can also prevent something badly from
happening, such as "if you want use this feature, please use connectorA. 
But if you want use another feature, please use connectorB". 

Best,
Kurt


On Tue, Sep 17, 2019 at 11:11 AM Jun Zhang <[hidden email]> wrote:
Hi Kurt:
thank you very much.
        I will take a closer look at the FLIP-63.

        I develop this PR, the underlying is StreamingFileSink, not BuckingSink, but I gave him a name, called Bucket.


On 09/17/2019 10:57[hidden email] wrote:
Hi Jun,

Thanks for bringing this up, in general I'm +1 on this feature. As
you might know, there is another ongoing efforts about such kind
of table sink, which covered in newly proposed partition support
reworking[1]. In this proposal, we also want to introduce a new 
file system connector, which can not only cover the partition 
support, but also end-to-end exactly once in streaming mode. 

I would suggest we could combine these two efforts into one. The 
benefits would be save some review efforts, also reduce the core
connector number to ease our maintaining effort in the future. 
What do you think?

BTW, BucketingSink is already deprecated, I think we should refer
to StreamingFileSink instead. 

Best,
Kurt


On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]> wrote:
Hello everyone:
I am a user and fan of flink. I also want to join the flink community. I contributed my first PR a few days ago. Can anyone help me to review my code? If there is something wrong, hope I would be grateful if you can give some advice.

This PR is mainly in the process of development, I use sql to read data from kafka and then write to hdfs, I found that there is no suitable tablesink, I found the document and found that File System Connector is only experimental (https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector), so I wrote a Bucket File System Table Sink that supports writing stream data. Hdfs, file file system, data format supports json, csv, parquet, avro. Subsequently add other format support, such as protobuf, thrift, etc.

In addition, I also added documentation, python api, units test, end-end-test, sql-client, DDL, and compiled on travis.

thank you very much


Reply | Threaded
Open this post in threaded view
|

回复: Add Bucket File System Table Sink

zhangjun

Hi Kurt:
Thanks.
When I encountered this problem, I found a File System Connector, but its function is not powerful enough and rich.
I also found that it is built into Flink, there are many unit tests that refer to it, so I dare not easily modify it to enrich its functions.

So I develop a new Connector, and later we can keep only one File System Connector and ensure that it is powerful and stable.

     I will learn about FLIP-63 and see if there is a better solution to combine these two functions. I am very willing to join this development.



------------------ 原始邮件 ------------------
发件人: "Kurt Young"<[hidden email]>;
发送时间: 2019年9月17日(星期二) 中午11:19
收件人: "Jun Zhang"<[hidden email]>;
抄送: "dev"<[hidden email]>;"user"<[hidden email]>;
主题: Re: Add Bucket File System Table Sink

Thanks. Let me clarify a bit more about my thinkings. Generally, I would
prefer we can concentrate the functionalities about connector, especially
some standard & most popular connectors, like kafka, different file
system with different formats, etc. We should make these core connectors
as powerful as we can, and can also prevent something badly from
happening, such as "if you want use this feature, please use connectorA.
But if you want use another feature, please use connectorB".

Best,
Kurt


On Tue, Sep 17, 2019 at 11:11 AM Jun Zhang <[hidden email]> wrote:

> Hi Kurt:
> thank you very much.
>         I will take a closer look at the FLIP-63.
>
>         I develop this PR, the underlying is StreamingFileSink, not
> BuckingSink, but I gave him a name, called Bucket.
>
>
> On 09/17/2019 10:57,Kurt Young<[hidden email]> <[hidden email]>
> wrote:
>
> Hi Jun,
>
> Thanks for bringing this up, in general I'm +1 on this feature. As
> you might know, there is another ongoing efforts about such kind
> of table sink, which covered in newly proposed partition support
> reworking[1]. In this proposal, we also want to introduce a new
> file system connector, which can not only cover the partition
> support, but also end-to-end exactly once in streaming mode.
>
> I would suggest we could combine these two efforts into one. The
> benefits would be save some review efforts, also reduce the core
> connector number to ease our maintaining effort in the future.
> What do you think?
>
> BTW, BucketingSink is already deprecated, I think we should refer
> to StreamingFileSink instead.
>
> Best,
> Kurt
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html
>
>
> On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]> wrote:
>
>> Hello everyone:
>> I am a user and fan of flink. I also want to join the flink community. I
>> contributed my first PR a few days ago. Can anyone help me to review my
>> code? If there is something wrong, hope I would be grateful if you can give
>> some advice.
>>
>> This PR is mainly in the process of development, I use sql to read data
>> from kafka and then write to hdfs, I found that there is no suitable
>> tablesink, I found the document and found that File System Connector is
>> only experimental (
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
>> so I wrote a Bucket File System Table Sink that supports writing stream
>> data. Hdfs, file file system, data format supports json, csv, parquet,
>> avro. Subsequently add other format support, such as protobuf, thrift, etc.
>>
>> In addition, I also added documentation, python api, units test,
>> end-end-test, sql-client, DDL, and compiled on travis.
>>
>> the issue is https://issues.apache.org/jira/browse/FLINK-12584
>> thank you very much
>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Kurt Young
Great to hear.

Best,
Kurt


On Tue, Sep 17, 2019 at 11:45 AM Jun Zhang <[hidden email]> wrote:

Hi Kurt:
Thanks.
When I encountered this problem, I found a File System Connector, but its function is not powerful enough and rich.
I also found that it is built into Flink, there are many unit tests that refer to it, so I dare not easily modify it to enrich its functions.

So I develop a new Connector, and later we can keep only one File System Connector and ensure that it is powerful and stable.

     I will learn about FLIP-63 and see if there is a better solution to combine these two functions. I am very willing to join this development.



------------------ 原始邮件 ------------------
发件人: "Kurt Young"<[hidden email]>;
发送时间: 2019年9月17日(星期二) 中午11:19
收件人: "Jun Zhang"<[hidden email]>;
抄送: "dev"<[hidden email]>;"user"<[hidden email]>;
主题: Re: Add Bucket File System Table Sink

Thanks. Let me clarify a bit more about my thinkings. Generally, I would
prefer we can concentrate the functionalities about connector, especially
some standard & most popular connectors, like kafka, different file
system with different formats, etc. We should make these core connectors
as powerful as we can, and can also prevent something badly from
happening, such as "if you want use this feature, please use connectorA.
But if you want use another feature, please use connectorB".

Best,
Kurt


On Tue, Sep 17, 2019 at 11:11 AM Jun Zhang <[hidden email]> wrote:

> Hi Kurt:
> thank you very much.
>         I will take a closer look at the FLIP-63.
>
>         I develop this PR, the underlying is StreamingFileSink, not
> BuckingSink, but I gave him a name, called Bucket.
>
>
> On 09/17/2019 10:57,Kurt Young<[hidden email]> <[hidden email]>
> wrote:
>
> Hi Jun,
>
> Thanks for bringing this up, in general I'm +1 on this feature. As
> you might know, there is another ongoing efforts about such kind
> of table sink, which covered in newly proposed partition support
> reworking[1]. In this proposal, we also want to introduce a new
> file system connector, which can not only cover the partition
> support, but also end-to-end exactly once in streaming mode.
>
> I would suggest we could combine these two efforts into one. The
> benefits would be save some review efforts, also reduce the core
> connector number to ease our maintaining effort in the future.
> What do you think?
>
> BTW, BucketingSink is already deprecated, I think we should refer
> to StreamingFileSink instead.
>
> Best,
> Kurt
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html
>
>
> On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]> wrote:
>
>> Hello everyone:
>> I am a user and fan of flink. I also want to join the flink community. I
>> contributed my first PR a few days ago. Can anyone help me to review my
>> code? If there is something wrong, hope I would be grateful if you can give
>> some advice.
>>
>> This PR is mainly in the process of development, I use sql to read data
>> from kafka and then write to hdfs, I found that there is no suitable
>> tablesink, I found the document and found that File System Connector is
>> only experimental (
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
>> so I wrote a Bucket File System Table Sink that supports writing stream
>> data. Hdfs, file file system, data format supports json, csv, parquet,
>> avro. Subsequently add other format support, such as protobuf, thrift, etc.
>>
>> In addition, I also added documentation, python api, units test,
>> end-end-test, sql-client, DDL, and compiled on travis.
>>
>> the issue is https://issues.apache.org/jira/browse/FLINK-12584
>> thank you very much
>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Fabian Hueske-2
Hi Jun,

Thank you very much for your contribution.

I think a Bucketing File System Table Sink would be a great addition.

Our code contribution guidelines [1] recommend to discuss the design with the community before opening a PR.
First of all, this ensures that the design is aligned with Flink's codebase and the future features.
Moreover, it helps to find a committer who can help to shepherd the PR.

Something that is always a good idea is to split a contribution in multiple smaller PRs (if possible).
This allows for faster review and progress.

Best, Fabian


Am Di., 17. Sept. 2019 um 04:39 Uhr schrieb Jun Zhang <[hidden email]>:
Hello everyone:
I am a user and fan of flink. I also want to join the flink community. I contributed my first PR a few days ago. Can anyone help me to review my code? If there is something wrong, hope I would be grateful if you can give some advice.

This PR is mainly in the process of development, I use sql to read data from kafka and then write to hdfs, I found that there is no suitable tablesink, I found the document and found that File System Connector is only experimental (https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector), so I wrote a Bucket File System Table Sink that supports writing stream data. Hdfs, file file system, data format supports json, csv, parquet, avro. Subsequently add other format support, such as protobuf, thrift, etc.

In addition, I also added documentation, python api, units test, end-end-test, sql-client, DDL, and compiled on travis.

thank you very much


Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

zhangjun
Hi,Fabian :

Thank you very much for your suggestion. This is when I use flink sql to write data to hdfs at work. I feel that it is inconvenient. I wrote this function, and then I want to contribute it to the community. This is my first PR , some processes may not be clear, I am very sorry.

Kurt suggested combining this feature with FLIP-63 because they have some common features, such as write data to file system with kinds of format, so I want to treat this function as a sub-task of FLIP-63. Add a partitionable  bucket file system table sink.

I then added the document and sent a DISCUSS to explain my detailed design ideas and implementation. How do you see it?



------------------ Original ------------------
From: Fabian Hueske <[hidden email]>
Date: Fri,Sep 20,2019 9:38 PM
To: Jun Zhang <[hidden email]>
Cc: dev <[hidden email]>, user <[hidden email]>
Subject: Re: Add Bucket File System Table Sink

Hi Jun,

Thank you very much for your contribution.

I think a Bucketing File System Table Sink would be a great addition.

Our code contribution guidelines [1] recommend to discuss the design with
the community before opening a PR.
First of all, this ensures that the design is aligned with Flink's codebase
and the future features.
Moreover, it helps to find a committer who can help to shepherd the PR.

Something that is always a good idea is to split a contribution in multiple
smaller PRs (if possible).
This allows for faster review and progress.

Best, Fabian

[1] https://flink.apache.org/contributing/contribute-code.html

Am Di., 17. Sept. 2019 um 04:39 Uhr schrieb Jun Zhang <[hidden email]>:

> Hello everyone:
> I am a user and fan of flink. I also want to join the flink community. I
> contributed my first PR a few days ago. Can anyone help me to review my
> code? If there is something wrong, hope I would be grateful if you can give
> some advice.
>
> This PR is mainly in the process of development, I use sql to read data
> from kafka and then write to hdfs, I found that there is no suitable
> tablesink, I found the document and found that File System Connector is
> only experimental (
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
> so I wrote a Bucket File System Table Sink that supports writing stream
> data. Hdfs, file file system, data format supports json, csv, parquet,
> avro. Subsequently add other format support, such as protobuf, thrift, etc.
>
> In addition, I also added documentation, python api, units test,
> end-end-test, sql-client, DDL, and compiled on travis.
>
> the issue is https://issues.apache.org/jira/browse/FLINK-12584
> thank you very much
>
>
>