(DEPRECATED) Apache Flink User Mailing List archive.

Providing hdfs name node IP for streaming file sink

Classic

List

Threaded

6 messages Options

Nick Bendtner

Providing hdfs name node IP for streaming file sink

Hi guys,

I am trying to write to hdfs from streaming file sink. Where should I provide the IP address of the name node ? Can I provide it as a part of the flink-config.yaml file or should I provide it like this :

final StreamingFileSink<GenericRecord> sink = StreamingFileSink
	.forBulkFormat(hdfs://namenode:8020/flink/test, ParquetAvroWriters.forGenericRecord(schema))

	.build();

Best,

Nick

Nick Bendtner

Re: Providing hdfs name node IP for streaming file sink

To add to this question, do I need to setup env.hadoop.conf.dir to point to the hadoop config for instance env.hadoop.conf.dir=/etc/hadoop/ for the jvm ? Or is it possible to write to hdfs without any external hadoop config like core-site.xml, hdfs-site.xml ?

Best,

Nick.

On Fri, Feb 28, 2020 at 12:56 PM Nick Bendtner <[hidden email]> wrote:

Hi guys,
I am trying to write to hdfs from streaming file sink. Where should I provide the IP address of the name node ? Can I provide it as a part of the flink-config.yaml file or should I provide it like this :
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
	.forBulkFormat(hdfs://namenode:8020/flink/test, ParquetAvroWriters.forGenericRecord(schema))
	.build();
Best,
Nick

Yang Wang

Re: Providing hdfs name node IP for streaming file sink

Hi Nick,

Certainly you could directly use "namenode:port" as the schema of you HDFS path.

Then the hadoop configs(e.g. core-site.xml, hdfs-site.xml) will not be necessary.

However, that also means you could benefit from the HDFS high-availability[1].

If your HDFS cluster is HA configured, i strongly suggest you to set the "HADOOP_CONF_DIR"

for your Flink application. Both the client and cluster(JM/TM) side need to be set. Then

your HDFS path could be specified like this "hdfs://myhdfs/flink/test". Given that "myhdfs"

is the name service configured in hdfs-site.xml.

Best,

Yang

[1]. http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

Nick Bendtner <[hidden email]> 于2020年2月29日周六上午6:00写道：

To add to this question, do I need to setup env.hadoop.conf.dir to point to the hadoop config for instance env.hadoop.conf.dir=/etc/hadoop/ for the jvm ? Or is it possible to write to hdfs without any external hadoop config like core-site.xml, hdfs-site.xml ?

Best,
Nick.
On Fri, Feb 28, 2020 at 12:56 PM Nick Bendtner <[hidden email]> wrote:
Hi guys,
I am trying to write to hdfs from streaming file sink. Where should I provide the IP address of the name node ? Can I provide it as a part of the flink-config.yaml file or should I provide it like this :
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
	.forBulkFormat(hdfs://namenode:8020/flink/test, ParquetAvroWriters.forGenericRecord(schema))
	.build();
Best,
Nick

Nick Bendtner

Re: Providing hdfs name node IP for streaming file sink

Thanks a lot Yang. What are your thoughts on catching the exception when a name node is down and retrying with the secondary name node ?

Best,

Nick.

On Sun, Mar 1, 2020 at 9:05 PM Yang Wang <[hidden email]> wrote:

Hi Nick,

Certainly you could directly use "namenode:port" as the schema of you HDFS path.
Then the hadoop configs(e.g. core-site.xml, hdfs-site.xml) will not be necessary.
However, that also means you could benefit from the HDFS high-availability[1].

If your HDFS cluster is HA configured, i strongly suggest you to set the "HADOOP_CONF_DIR"
for your Flink application. Both the client and cluster(JM/TM) side need to be set. Then
your HDFS path could be specified like this "hdfs://myhdfs/flink/test". Given that "myhdfs"
is the name service configured in hdfs-site.xml.

Best,
Yang

[1]. http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Nick Bendtner <[hidden email]> 于2020年2月29日周六上午6:00写道：
To add to this question, do I need to setup env.hadoop.conf.dir to point to the hadoop config for instance env.hadoop.conf.dir=/etc/hadoop/ for the jvm ? Or is it possible to write to hdfs without any external hadoop config like core-site.xml, hdfs-site.xml ?

Best,
Nick.
On Fri, Feb 28, 2020 at 12:56 PM Nick Bendtner <[hidden email]> wrote:
Hi guys,
I am trying to write to hdfs from streaming file sink. Where should I provide the IP address of the name node ? Can I provide it as a part of the flink-config.yaml file or should I provide it like this :
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
	.forBulkFormat(hdfs://namenode:8020/flink/test, ParquetAvroWriters.forGenericRecord(schema))
	.build();
Best,
Nick

Yang Wang

Re: Providing hdfs name node IP for streaming file sink

It may work. However, you need to set your own retry policy(similar as `ConfiguredFailoverProxyProvider` in hadoop).

Also if you directly use namenode address and do not load HDFS configuration, some HDFS client configuration (e.g.

dfs.client.*) will not take effect.

Best,

Yang

Nick Bendtner <[hidden email]> 于2020年3月2日周一下午11:58写道：

Thanks a lot Yang. What are your thoughts on catching the exception when a name node is down and retrying with the secondary name node ?

Best,
Nick.
On Sun, Mar 1, 2020 at 9:05 PM Yang Wang <[hidden email]> wrote:
Hi Nick,

Certainly you could directly use "namenode:port" as the schema of you HDFS path.
Then the hadoop configs(e.g. core-site.xml, hdfs-site.xml) will not be necessary.
However, that also means you could benefit from the HDFS high-availability[1].

If your HDFS cluster is HA configured, i strongly suggest you to set the "HADOOP_CONF_DIR"
for your Flink application. Both the client and cluster(JM/TM) side need to be set. Then
your HDFS path could be specified like this "hdfs://myhdfs/flink/test". Given that "myhdfs"
is the name service configured in hdfs-site.xml.

Best,
Yang

[1]. http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Nick Bendtner <[hidden email]> 于2020年2月29日周六上午6:00写道：
To add to this question, do I need to setup env.hadoop.conf.dir to point to the hadoop config for instance env.hadoop.conf.dir=/etc/hadoop/ for the jvm ? Or is it possible to write to hdfs without any external hadoop config like core-site.xml, hdfs-site.xml ?

Best,
Nick.
On Fri, Feb 28, 2020 at 12:56 PM Nick Bendtner <[hidden email]> wrote:
Hi guys,
I am trying to write to hdfs from streaming file sink. Where should I provide the IP address of the name node ? Can I provide it as a part of the flink-config.yaml file or should I provide it like this :
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
	.forBulkFormat(hdfs://namenode:8020/flink/test, ParquetAvroWriters.forGenericRecord(schema))
	.build();
Best,
Nick

Vishwas Siravara

Re: Providing hdfs name node IP for streaming file sink

Thanks Yang. Going with setting the HADOOP_CONF_DIR in the flink application. It integrates neatly with flink.

Best,

Nick.

On Mon, Mar 2, 2020 at 7:42 PM Yang Wang <[hidden email]> wrote:

It may work. However, you need to set your own retry policy(similar as `ConfiguredFailoverProxyProvider` in hadoop).
Also if you directly use namenode address and do not load HDFS configuration, some HDFS client configuration (e.g.
dfs.client.*) will not take effect.

Best,
Yang
Nick Bendtner <[hidden email]> 于2020年3月2日周一下午11:58写道：
Thanks a lot Yang. What are your thoughts on catching the exception when a name node is down and retrying with the secondary name node ?

Best,
Nick.
On Sun, Mar 1, 2020 at 9:05 PM Yang Wang <[hidden email]> wrote:
Hi Nick,

Certainly you could directly use "namenode:port" as the schema of you HDFS path.
Then the hadoop configs(e.g. core-site.xml, hdfs-site.xml) will not be necessary.
However, that also means you could benefit from the HDFS high-availability[1].

If your HDFS cluster is HA configured, i strongly suggest you to set the "HADOOP_CONF_DIR"
for your Flink application. Both the client and cluster(JM/TM) side need to be set. Then
your HDFS path could be specified like this "hdfs://myhdfs/flink/test". Given that "myhdfs"
is the name service configured in hdfs-site.xml.

Best,
Yang

[1]. http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Nick Bendtner <[hidden email]> 于2020年2月29日周六上午6:00写道：
To add to this question, do I need to setup env.hadoop.conf.dir to point to the hadoop config for instance env.hadoop.conf.dir=/etc/hadoop/ for the jvm ? Or is it possible to write to hdfs without any external hadoop config like core-site.xml, hdfs-site.xml ?

Best,
Nick.
On Fri, Feb 28, 2020 at 12:56 PM Nick Bendtner <[hidden email]> wrote:
Hi guys,
I am trying to write to hdfs from streaming file sink. Where should I provide the IP address of the name node ? Can I provide it as a part of the flink-config.yaml file or should I provide it like this :
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
	.forBulkFormat(hdfs://namenode:8020/flink/test, ParquetAvroWriters.forGenericRecord(schema))
	.build();
Best,
Nick