Best way to write data to HDFS by Flink

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Best way to write data to HDFS by Flink

hawin

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 

Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

Márton Balassi
Dear Hawin,

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.
I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
      .map(/* do you operations*/)
      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]


Best,

Marton

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 


Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

hawin
Thanks Marton
I will use this code to implement my testing.



Best regards
Hawin

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.
I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
      .map(/* do you operations*/)
      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]


Best,

Marton

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 



Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

hawin
Hi  Marton

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL
I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

What is the partition.  
For example:
I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 
if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 
Flume knows the coming data will come to right partition.  

I am not sure Flink also provided a similar partition API or configuration for this. 
Thanks.



Best regards
Hawin

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:
Thanks Marton
I will use this code to implement my testing.



Best regards
Hawin

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.
I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
      .map(/* do you operations*/)
      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]


Best,

Marton

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 




Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

Márton Balassi
Dear Hawin,

We do not have out of the box support for that, it is something you would need to implement yourself in a custom SinkFunction.

Best,

Marton

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang <[hidden email]> wrote:
Hi  Marton

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL
I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

What is the partition.  
For example:
I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 
if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 
Flume knows the coming data will come to right partition.  

I am not sure Flink also provided a similar partition API or configuration for this. 
Thanks.



Best regards
Hawin

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:
Thanks Marton
I will use this code to implement my testing.



Best regards
Hawin

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.
I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
      .map(/* do you operations*/)
      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]


Best,

Marton

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 





Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

Stephan Ewen
Hi Hawin!

If you are creating code for such an output into different files/partitions, it would be amazing if you could contribute this code to Flink.

It seems like a very common use case, so this functionality will be useful to other user as well!

Greetings,
Stephan


On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

We do not have out of the box support for that, it is something you would need to implement yourself in a custom SinkFunction.

Best,

Marton

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang <[hidden email]> wrote:
Hi  Marton

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL
I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

What is the partition.  
For example:
I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 
if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 
Flume knows the coming data will come to right partition.  

I am not sure Flink also provided a similar partition API or configuration for this. 
Thanks.



Best regards
Hawin

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:
Thanks Marton
I will use this code to implement my testing.



Best regards
Hawin

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.
I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
      .map(/* do you operations*/)
      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]


Best,

Marton

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 






Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

hawin
Hi Stephan

Yes, that is a great idea.  if it is possible,  I will try my best to contribute some codes to Flink. 
But I have to run some flink examples first to understand Apache Flink.
I just run some kafka with flink examples.  No examples working for me.   I am so sad right now.
I didn't get any troubles to run kafka examples from kafka.apache.org so far. 
Please suggest me.
Thanks.



Best regards
Hawin


On Wed, Jun 24, 2015 at 1:02 AM, Stephan Ewen <[hidden email]> wrote:
Hi Hawin!

If you are creating code for such an output into different files/partitions, it would be amazing if you could contribute this code to Flink.

It seems like a very common use case, so this functionality will be useful to other user as well!

Greetings,
Stephan


On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

We do not have out of the box support for that, it is something you would need to implement yourself in a custom SinkFunction.

Best,

Marton

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang <[hidden email]> wrote:
Hi  Marton

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL
I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

What is the partition.  
For example:
I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 
if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 
Flume knows the coming data will come to right partition.  

I am not sure Flink also provided a similar partition API or configuration for this. 
Thanks.



Best regards
Hawin

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:
Thanks Marton
I will use this code to implement my testing.



Best regards
Hawin

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.
I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
      .map(/* do you operations*/)
      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]


Best,

Marton

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 







Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

Márton Balassi
Dear Hawin,

As for your issues with running the Flink Kafka examples: are those resolved with Aljoscha's comment in the other thread? :)

Best,

Marton

On Fri, Jun 26, 2015 at 8:40 AM, Hawin Jiang <[hidden email]> wrote:
Hi Stephan

Yes, that is a great idea.  if it is possible,  I will try my best to contribute some codes to Flink. 
But I have to run some flink examples first to understand Apache Flink.
I just run some kafka with flink examples.  No examples working for me.   I am so sad right now.
I didn't get any troubles to run kafka examples from kafka.apache.org so far. 
Please suggest me.
Thanks.



Best regards
Hawin


On Wed, Jun 24, 2015 at 1:02 AM, Stephan Ewen <[hidden email]> wrote:
Hi Hawin!

If you are creating code for such an output into different files/partitions, it would be amazing if you could contribute this code to Flink.

It seems like a very common use case, so this functionality will be useful to other user as well!

Greetings,
Stephan


On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

We do not have out of the box support for that, it is something you would need to implement yourself in a custom SinkFunction.

Best,

Marton

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang <[hidden email]> wrote:
Hi  Marton

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL
I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

What is the partition.  
For example:
I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 
if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 
Flume knows the coming data will come to right partition.  

I am not sure Flink also provided a similar partition API or configuration for this. 
Thanks.



Best regards
Hawin

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:
Thanks Marton
I will use this code to implement my testing.



Best regards
Hawin

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:
Dear Hawin,

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.
I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.addSource(PerisitentKafkaSource(..))
      .map(/* do you operations*/)
      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");

Check out the relevant section of the streaming docs for more info. [1]


Best,

Marton

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 








Reply | Threaded
Open this post in threaded view
|

RE: Best way to write data to HDFS by Flink

hawin

Dear  Marton

 

Thanks for your asking.  Yes. it is working now.

But, the TPS is not very good.   I have met four issues as below

 

1.       My TPS around 2000 events per second.   But I saw some companies achieved 132K per second on single node at 2015 Los Angeles big data day yesterday.   For two nodes, the TPS is 282K per sec.  them used kafka+Spark.

As you knew  that I used kafka+Flink. Maybe we have to do more investigations from my side.   

 

2.       Regarding my performance testing, I used JMeter to producer data to Kafka.  The total messages in JMeter side is not matched HDFS side.   In the meantime, I used flink to write data to HDFS. 

 

3.       I found that Flink randomly created 1, 2, 3 and 4 folders. Only 1 and 4 folders have files.  The 2 and 3 folders don’t have any files.

 

4.       I am going to develop some codes to write data to /data/flink/year/month/day/hour folder.  I think that folder structure is good for flink table API in the future.

 

Please let me know if you have some comments or suggests for me.

Thanks.

 

 

 

Best regards

Hawin

 

From: Márton Balassi [mailto:[hidden email]]
Sent: Sunday, June 28, 2015 9:09 PM
To: [hidden email]
Subject: Re: Best way to write data to HDFS by Flink

 

Dear Hawin,

 

As for your issues with running the Flink Kafka examples: are those resolved with Aljoscha's comment in the other thread? :)

 

Best,

 

Marton

 

On Fri, Jun 26, 2015 at 8:40 AM, Hawin Jiang <[hidden email]> wrote:

Hi Stephan

 

Yes, that is a great idea.  if it is possible,  I will try my best to contribute some codes to Flink. 

But I have to run some flink examples first to understand Apache Flink.

I just run some kafka with flink examples.  No examples working for me.   I am so sad right now.

I didn't get any troubles to run kafka examples from kafka.apache.org so far. 

Please suggest me.

Thanks.

 

 

 

Best regards

Hawin

 

 

On Wed, Jun 24, 2015 at 1:02 AM, Stephan Ewen <[hidden email]> wrote:

Hi Hawin!

 

If you are creating code for such an output into different files/partitions, it would be amazing if you could contribute this code to Flink.

 

It seems like a very common use case, so this functionality will be useful to other user as well!

 

Greetings,
Stephan

 

 

On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi <[hidden email]> wrote:

Dear Hawin,

 

We do not have out of the box support for that, it is something you would need to implement yourself in a custom SinkFunction.

 

Best,

 

Marton

 

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang <[hidden email]> wrote:

Hi  Marton

 

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL

I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

 

What is the partition.  

For example:

I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 

if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 

Flume knows the coming data will come to right partition.  

 

I am not sure Flink also provided a similar partition API or configuration for this. 

Thanks.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:

Thanks Marton

I will use this code to implement my testing.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:

Dear Hawin,

 

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.

I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

 

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

 

env.addSource(PerisitentKafkaSource(..))

      .map(/* do you operations*/)

      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");


Check out the relevant section of the streaming docs for more info. [1]

 

 

Best,

 

Marton

 

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 

 

 

 

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

Stephan Ewen
Hi Hawin!

The performance tuning of Kafka is much trickier than that of Flink. Your performance bottleneck may be Kafka at this point, not Flink.
To make Kafka fast, make sure you have the right setup for the data directories, and you set up zookeeper properly (for good throughput).

To test the Kafka throughput in isolation, push data into Kafka and just consume it with a command line client that pipes to /dev/null

Greetings,
Stephan




On Mon, Jun 29, 2015 at 9:56 AM, Hawin Jiang <[hidden email]> wrote:

Dear  Marton

 

Thanks for your asking.  Yes. it is working now.

But, the TPS is not very good.   I have met four issues as below

 

1.       My TPS around 2000 events per second.   But I saw some companies achieved 132K per second on single node at 2015 Los Angeles big data day yesterday.   For two nodes, the TPS is 282K per sec.  them used kafka+Spark.

As you knew  that I used kafka+Flink. Maybe we have to do more investigations from my side.   

 

2.       Regarding my performance testing, I used JMeter to producer data to Kafka.  The total messages in JMeter side is not matched HDFS side.   In the meantime, I used flink to write data to HDFS. 

 

3.       I found that Flink randomly created 1, 2, 3 and 4 folders. Only 1 and 4 folders have files.  The 2 and 3 folders don’t have any files.

 

4.       I am going to develop some codes to write data to /data/flink/year/month/day/hour folder.  I think that folder structure is good for flink table API in the future.

 

Please let me know if you have some comments or suggests for me.

Thanks.

 

 

 

Best regards

Hawin

 

From: Márton Balassi [mailto:[hidden email]]
Sent: Sunday, June 28, 2015 9:09 PM
To: [hidden email]
Subject: Re: Best way to write data to HDFS by Flink

 

Dear Hawin,

 

As for your issues with running the Flink Kafka examples: are those resolved with Aljoscha's comment in the other thread? :)

 

Best,

 

Marton

 

On Fri, Jun 26, 2015 at 8:40 AM, Hawin Jiang <[hidden email]> wrote:

Hi Stephan

 

Yes, that is a great idea.  if it is possible,  I will try my best to contribute some codes to Flink. 

But I have to run some flink examples first to understand Apache Flink.

I just run some kafka with flink examples.  No examples working for me.   I am so sad right now.

I didn't get any troubles to run kafka examples from kafka.apache.org so far. 

Please suggest me.

Thanks.

 

 

 

Best regards

Hawin

 

 

On Wed, Jun 24, 2015 at 1:02 AM, Stephan Ewen <[hidden email]> wrote:

Hi Hawin!

 

If you are creating code for such an output into different files/partitions, it would be amazing if you could contribute this code to Flink.

 

It seems like a very common use case, so this functionality will be useful to other user as well!

 

Greetings,
Stephan

 

 

On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi <[hidden email]> wrote:

Dear Hawin,

 

We do not have out of the box support for that, it is something you would need to implement yourself in a custom SinkFunction.

 

Best,

 

Marton

 

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang <[hidden email]> wrote:

Hi  Marton

 

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL

I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

 

What is the partition.  

For example:

I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 

if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 

Flume knows the coming data will come to right partition.  

 

I am not sure Flink also provided a similar partition API or configuration for this. 

Thanks.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:

Thanks Marton

I will use this code to implement my testing.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:

Dear Hawin,

 

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.

I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

 

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

 

env.addSource(PerisitentKafkaSource(..))

      .map(/* do you operations*/)

      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");


Check out the relevant section of the streaming docs for more info. [1]

 

 

Best,

 

Marton

 

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin

 

 

 

 

 

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Best way to write data to HDFS by Flink

rmetzger0
Hey,
can you measure how fast jmeter is able to push data into Kafka? Maybe that is already the bottleneck.

Flink should be able to read from Kafka with 100k+ elements/second on a single node.

On Mon, Jun 29, 2015 at 11:10 AM, Stephan Ewen <[hidden email]> wrote:
Hi Hawin!

The performance tuning of Kafka is much trickier than that of Flink. Your performance bottleneck may be Kafka at this point, not Flink.
To make Kafka fast, make sure you have the right setup for the data directories, and you set up zookeeper properly (for good throughput).

To test the Kafka throughput in isolation, push data into Kafka and just consume it with a command line client that pipes to /dev/null

Greetings,
Stephan




On Mon, Jun 29, 2015 at 9:56 AM, Hawin Jiang <[hidden email]> wrote:

Dear  Marton

 

Thanks for your asking.  Yes. it is working now.

But, the TPS is not very good.   I have met four issues as below

 

1.       My TPS around 2000 events per second.   But I saw some companies achieved 132K per second on single node at 2015 Los Angeles big data day yesterday.   For two nodes, the TPS is 282K per sec.  them used kafka+Spark.

As you knew  that I used kafka+Flink. Maybe we have to do more investigations from my side.   

 

2.       Regarding my performance testing, I used JMeter to producer data to Kafka.  The total messages in JMeter side is not matched HDFS side.   In the meantime, I used flink to write data to HDFS. 

 

3.       I found that Flink randomly created 1, 2, 3 and 4 folders. Only 1 and 4 folders have files.  The 2 and 3 folders don’t have any files.

 

4.       I am going to develop some codes to write data to /data/flink/year/month/day/hour folder.  I think that folder structure is good for flink table API in the future.

 

Please let me know if you have some comments or suggests for me.

Thanks.

 

 

 

Best regards

Hawin

 

From: Márton Balassi [mailto:[hidden email]]
Sent: Sunday, June 28, 2015 9:09 PM
To: [hidden email]
Subject: Re: Best way to write data to HDFS by Flink

 

Dear Hawin,

 

As for your issues with running the Flink Kafka examples: are those resolved with Aljoscha's comment in the other thread? :)

 

Best,

 

Marton

 

On Fri, Jun 26, 2015 at 8:40 AM, Hawin Jiang <[hidden email]> wrote:

Hi Stephan

 

Yes, that is a great idea.  if it is possible,  I will try my best to contribute some codes to Flink. 

But I have to run some flink examples first to understand Apache Flink.

I just run some kafka with flink examples.  No examples working for me.   I am so sad right now.

I didn't get any troubles to run kafka examples from kafka.apache.org so far. 

Please suggest me.

Thanks.

 

 

 

Best regards

Hawin

 

 

On Wed, Jun 24, 2015 at 1:02 AM, Stephan Ewen <[hidden email]> wrote:

Hi Hawin!

 

If you are creating code for such an output into different files/partitions, it would be amazing if you could contribute this code to Flink.

 

It seems like a very common use case, so this functionality will be useful to other user as well!

 

Greetings,
Stephan

 

 

On Tue, Jun 23, 2015 at 3:36 PM, Márton Balassi <[hidden email]> wrote:

Dear Hawin,

 

We do not have out of the box support for that, it is something you would need to implement yourself in a custom SinkFunction.

 

Best,

 

Marton

 

On Mon, Jun 22, 2015 at 11:51 PM, Hawin Jiang <[hidden email]> wrote:

Hi  Marton

 

if we received a huge data from kafka and wrote to HDFS immediately.  We should use buffer timeout based on your URL

I am not sure you have flume experience.  Flume can be configured buffer size and partition as well.

 

What is the partition.  

For example:

I want to write 1 minute buffer file to HDFS which is /data/flink/year=2015/month=06/day=22/hour=21. 

if the partition(/data/flink/year=2015/month=06/day=22/hour=21) is there, no need to create it. Otherwise, flume will create it automatically. 

Flume knows the coming data will come to right partition.  

 

I am not sure Flink also provided a similar partition API or configuration for this. 

Thanks.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 10:31 AM, Hawin Jiang <[hidden email]> wrote:

Thanks Marton

I will use this code to implement my testing.

 

 

 

Best regards

Hawin

 

On Wed, Jun 10, 2015 at 1:30 AM, Márton Balassi <[hidden email]> wrote:

Dear Hawin,

 

You can pass a hdfs path to DataStream's and DataSet's writeAsText and writeAsCsv methods.

I assume that you are running a Streaming topology, because your source is Kafka, so it would look like the following:

 

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

 

env.addSource(PerisitentKafkaSource(..))

      .map(/* do you operations*/)

      .wirteAsText("hdfs://<namenode_name>:<namenode_port>/path/to/your/file");


Check out the relevant section of the streaming docs for more info. [1]

 

 

Best,

 

Marton

 

On Wed, Jun 10, 2015 at 10:22 AM, Hawin Jiang <[hidden email]> wrote:

Hi All

 

Can someone tell me what is the best way to write data to HDFS when Flink received data from Kafka?

Big thanks for your example.

 

 

 

 

Best regards

Hawin