Streaming kafka data sink to hive

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Streaming kafka data sink to hive

wanglei2@geekplus.com.cn

We have many app logs on our app server  and want to parse the logs to structed table format and then sink to hive.
Seems it is good to use batch mode. The app log is hourly compressed and it is convenience to do partitioning.

We want to use streaming mode. Tail the app logs to Kafka,  then use flink to read kafka topic  and then sink to Hive.
I have several questions.

1  Is there any flink-hive-connector that i can use to write to hive streamingly?
2  Since HDFS is not friendly to frequently append and hive's data is stored to hdfs,  is it  OK if the throughput is high? 

Thanks,
Lei


Reply | Threaded
Open this post in threaded view
|

Re: Streaming kafka data sink to hive

Jingsong Li
Hi wanglei,

> 1  Is there any flink-hive-connector that i can use to write to hive streamingly?

"Streaming kafka data sink to hive" is under discussion.[1]
And POC work is ongoing.[2] We want to support it in release-1.11.

> 2  Since HDFS is not friendly to frequently append and hive's data is stored to hdfs,  is it  OK if the throughput is high?

We should concern small files, It's better to have 128MB for each file.
If the throughput is high, I think you can try to write files in 5 minutes or 10 minutes.
You can learn more in [3].


Best,
Jingsong Lee

On Fri, Mar 20, 2020 at 11:55 AM [hidden email] <[hidden email]> wrote:

We have many app logs on our app server  and want to parse the logs to structed table format and then sink to hive.
Seems it is good to use batch mode. The app log is hourly compressed and it is convenience to do partitioning.

We want to use streaming mode. Tail the app logs to Kafka,  then use flink to read kafka topic  and then sink to Hive.
I have several questions.

1  Is there any flink-hive-connector that i can use to write to hive streamingly?
2  Since HDFS is not friendly to frequently append and hive's data is stored to hdfs,  is it  OK if the throughput is high? 

Thanks,
Lei




--
Best, Jingsong Lee