Reading from HDFS and publishing to Kafka

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading from HDFS and publishing to Kafka

Damien Hawes
Hi folks,

I've got the following use case, where I need to read data from HDFS and publish the data to Kafka, such that it can be reprocessed by another job.

I've searched the web and read the docs. This has turned up no and concrete examples or information of how this is achieved, or even if it's possible at all.

Further context:

1. Flink will be deployed to Kubernetes.
2. Kerberos is active on Hadoop.
3. The data is stored on HDFS as Avro.
4. I cannot install Flink on our Hadoop environment.
5. No stateful computations will be performed.

I've noticed that the flink-avro package provides a class called AvroInputFormat<T>, with a nullable path field, and I think this is my goto.

Apologies for the poor formatting ahead, but the code I have in mind looks something like this:



StreamingExecutionEnvironment env = ...;
AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class);
DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data");
// rest, + publishing to Kafka using the FlinkKafkaProducer



My major questions and concerns are:

1. Is it possible to use read from HDFS using the StreamingExecutionEnvironment object? I'm planning on using the Data Stream API because of point (2) below.
2. Because Flink will be deployed on Kubernetes, I have a major concern that if I were to use the Data Set API, once Flink completes and exits, the pods will restart, causing unnecessary duplication of data. Is the pod restart a valid concern?
3. Is there anything special I need to be worried about regarding Kerberos in this instance? The key tab will be materialised on the pods upon start up.
4. Is this even a valid approach? The dataset I need to read and replay is small (12 TB).

Any help, even in part will be appreciated. 

Kind regards, 

Damien



Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS and publishing to Kafka

r_khachatryan
Hi,

1. Yes, StreamingExecutionEnvironment.readFile can be used for files on HDFS
2. I think this is a valid concern. Besides that, there are plans to deprecate DataSet API [1]
4. Yes, the approach looks good

I'm pulling in Aljoscha for your 3rd question (and probably some clarifications on others).


Regards,
Roman


On Fri, Sep 25, 2020 at 12:50 PM Damien Hawes <[hidden email]> wrote:
Hi folks,

I've got the following use case, where I need to read data from HDFS and publish the data to Kafka, such that it can be reprocessed by another job.

I've searched the web and read the docs. This has turned up no and concrete examples or information of how this is achieved, or even if it's possible at all.

Further context:

1. Flink will be deployed to Kubernetes.
2. Kerberos is active on Hadoop.
3. The data is stored on HDFS as Avro.
4. I cannot install Flink on our Hadoop environment.
5. No stateful computations will be performed.

I've noticed that the flink-avro package provides a class called AvroInputFormat<T>, with a nullable path field, and I think this is my goto.

Apologies for the poor formatting ahead, but the code I have in mind looks something like this:



StreamingExecutionEnvironment env = ...;
AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class);
DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data");
// rest, + publishing to Kafka using the FlinkKafkaProducer



My major questions and concerns are:

1. Is it possible to use read from HDFS using the StreamingExecutionEnvironment object? I'm planning on using the Data Stream API because of point (2) below.
2. Because Flink will be deployed on Kubernetes, I have a major concern that if I were to use the Data Set API, once Flink completes and exits, the pods will restart, causing unnecessary duplication of data. Is the pod restart a valid concern?
3. Is there anything special I need to be worried about regarding Kerberos in this instance? The key tab will be materialised on the pods upon start up.
4. Is this even a valid approach? The dataset I need to read and replay is small (12 TB).

Any help, even in part will be appreciated. 

Kind regards, 

Damien



Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS and publishing to Kafka

Aljoscha Krettek
Hi,

I actually have no experience running a Flink job on K8s against a
kerberized HDFS so please take what I'll say with a grain of salt.

The only thing you should need to do is to configure the path of your
keytab and possibly some other Kerberos settings. For that check out [1]
and [2].

I think in general this looks like the right approach and Romans
comments are correct as well.

Aljoscha

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/security-kerberos.html#yarnmesos-mode
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#auth-with-external-systems

On 27.09.20 21:54, Khachatryan Roman wrote:

> Hi,
>
> 1. Yes, StreamingExecutionEnvironment.readFile can be used for files on HDFS
> 2. I think this is a valid concern. Besides that, there are plans to
> deprecate DataSet API [1]
> 4. Yes, the approach looks good
>
> I'm pulling in Aljoscha for your 3rd question (and probably some
> clarifications on others).
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741
>
> Regards,
> Roman
>
>
> On Fri, Sep 25, 2020 at 12:50 PM Damien Hawes <[hidden email]>
> wrote:
>
>> Hi folks,
>>
>> I've got the following use case, where I need to read data from HDFS and
>> publish the data to Kafka, such that it can be reprocessed by another job.
>>
>> I've searched the web and read the docs. This has turned up no and
>> concrete examples or information of how this is achieved, or even if it's
>> possible at all.
>>
>> Further context:
>>
>> 1. Flink will be deployed to Kubernetes.
>> 2. Kerberos is active on Hadoop.
>> 3. The data is stored on HDFS as Avro.
>> 4. I cannot install Flink on our Hadoop environment.
>> 5. No stateful computations will be performed.
>>
>> I've noticed that the flink-avro package provides a class called
>> AvroInputFormat<T>, with a nullable path field, and I think this is my goto.
>>
>> Apologies for the poor formatting ahead, but the code I have in mind looks
>> something like this:
>>
>>
>>
>> StreamingExecutionEnvironment env = ...;
>> AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class);
>> DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data");
>> // rest, + publishing to Kafka using the FlinkKafkaProducer
>>
>>
>>
>> My major questions and concerns are:
>>
>> 1. Is it possible to use read from HDFS using the
>> StreamingExecutionEnvironment object? I'm planning on using the Data Stream
>> API because of point (2) below.
>> 2. Because Flink will be deployed on Kubernetes, I have a major concern
>> that if I were to use the Data Set API, once Flink completes and exits, the
>> pods will restart, causing unnecessary duplication of data. Is the pod
>> restart a valid concern?
>> 3. Is there anything special I need to be worried about regarding Kerberos
>> in this instance? The key tab will be materialised on the pods upon start
>> up.
>> 4. Is this even a valid approach? The dataset I need to read and replay is
>> small (12 TB).
>>
>> Any help, even in part will be appreciated.
>>
>> Kind regards,
>>
>> Damien
>>
>>
>>
>>
>