(DEPRECATED) Apache Flink User Mailing List archive.

Using Kafka and Flink for batch processing of a batch data source

Classic

List

Threaded

6 messages Options

Leith Mudge

Using Kafka and Flink for batch processing of a batch data source

I am currently working on an architecture for a big data streaming and batch processing platform. I am planning on using Apache Kafka for a distributed messaging system to handle data from streaming data sources and then pass on to Apache Flink for stream processing. I would also like to use Flink's batch processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis as a source for Flink batch processing (is this even possible?) or should I just write the batch data to a data store and then process by reading into Flink?

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |

milind parikh

Re: Using Kafka and Flink for batch processing of a batch data source

It likely does not make sense to publish a file ( "batch data") into Kafka; unless the file is very small.

An improvised pub-sub mechanism for Kafka could be to (a) write the file into a persistent store outside of kafka (b) publishing of a message into Kafka about that write so as to enable processing of that file.

If you really needed to have provenance around processing, you could route data processing through Nifi before Flink.

Regards
Milind

On Jul 19, 2016 9:37 PM, "Leith Mudge" <[hidden email]> wrote:

I am currently working on an architecture for a big data streaming and batch processing platform. I am planning on using Apache Kafka for a distributed messaging system to handle data from streaming data sources and then pass on to Apache Flink for stream processing. I would also like to use Flink's batch processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis as a source for Flink batch processing (is this even possible?) or should I just write the batch data to a data store and then process by reading into Flink?

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |

Till Rohrmann

Re: Using Kafka and Flink for batch processing of a batch data source

At the moment there is also no batch source for Kafka. I'm also not so sure how you would define a batch given a Kafka stream. Only reading till a certain offset? Or maybe until one has read n messages?

I think it's best to write the batch data to HDFS or another batch data store.

Cheers,

Till

On Wed, Jul 20, 2016 at 8:08 AM, milind parikh <[hidden email]> wrote:

It likely does not make sense to publish a file ( "batch data") into Kafka; unless the file is very small.

An improvised pub-sub mechanism for Kafka could be to (a) write the file into a persistent store outside of kafka (b) publishing of a message into Kafka about that write so as to enable processing of that file.

If you really needed to have provenance around processing, you could route data processing through Nifi before Flink.

Regards
Milind

On Jul 19, 2016 9:37 PM, "Leith Mudge" <[hidden email]> wrote:

I am currently working on an architecture for a big data streaming and batch processing platform. I am planning on using Apache Kafka for a distributed messaging system to handle data from streaming data sources and then pass on to Apache Flink for stream processing. I would also like to use Flink's batch processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis as a source for Flink batch processing (is this even possible?) or should I just write the batch data to a data store and then process by reading into Flink?

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |

Leith Mudge

Re: Using Kafka and Flink for batch processing of a batch data source

Thanks Milind & Till,

This is what I thought from my reading of the documentation but it is nice to have it confirmed by people more knowledgeable.

Supplementary to this question is whether Flink is the best choice for batch processing at this point in time or whether I would be better to look at a more mature and dedicated batch processing engine such as Spark? I do like the choices that adopting the unified programming model outlined in Apache Beam/Google Cloud Dataflow SDK and this purports to have runners for both Flink and Spark.

Regards,

Leith

From: Till Rohrmann <[hidden email]>
Date: Wednesday, 20 July 2016 at 5:05 PM
To: <[hidden email]>
Subject: Re: Using Kafka and Flink for batch processing of a batch data source

I think it's best to write the batch data to HDFS or another batch data store.

Cheers,

Till

On Wed, Jul 20, 2016 at 8:08 AM, milind parikh <[hidden email]> wrote:

It likely does not make sense to publish a file ( "batch data") into Kafka; unless the file is very small.

An improvised pub-sub mechanism for Kafka could be to (a) write the file into a persistent store outside of kafka (b) publishing of a message into Kafka about that write so as to enable processing of that file.

If you really needed to have provenance around processing, you could route data processing through Nifi before Flink.

Regards
Milind

On Jul 19, 2016 9:37 PM, "Leith Mudge" <[hidden email]> wrote:

I am currently working on an architecture for a big data streaming and batch processing platform. I am planning on using Apache Kafka for a distributed messaging system to handle data from streaming data sources and then pass on to Apache Flink for stream processing. I would also like to use Flink's batch processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis as a source for Flink batch processing (is this even possible?) or should I just write the batch data to a data store and then process by reading into Flink?

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |

milind parikh

Re: Using Kafka and Flink for batch processing of a batch data source

At this point in time, imo, batch processing is not why you should be considering Flink.

That said, I predict that the stream processing (and event processing) will become the dominant methodology; as we begin to gravitate towards "I can't wait; I want it now" phenomenon. In that methodology, I believe Flink represents the cutting edge of what is possible; at this point in time.

Regards
Milind

On Jul 20, 2016 4:57 PM, "Leith Mudge" <[hidden email]> wrote:

Thanks Milind & Till,

This is what I thought from my reading of the documentation but it is nice to have it confirmed by people more knowledgeable.

Supplementary to this question is whether Flink is the best choice for batch processing at this point in time or whether I would be better to look at a more mature and dedicated batch processing engine such as Spark? I do like the choices that adopting the unified programming model outlined in Apache Beam/Google Cloud Dataflow SDK and this purports to have runners for both Flink and Spark.

Regards,

Leith

From: Till Rohrmann <[hidden email]>
Date: Wednesday, 20 July 2016 at 5:05 PM
To: <[hidden email]>
Subject: Re: Using Kafka and Flink for batch processing of a batch data source

At the moment there is also no batch source for Kafka. I'm also not so sure how you would define a batch given a Kafka stream. Only reading till a certain offset? Or maybe until one has read n messages?

I think it's best to write the batch data to HDFS or another batch data store.

Cheers,

Till

On Wed, Jul 20, 2016 at 8:08 AM, milind parikh <[hidden email]> wrote:

It likely does not make sense to publish a file ( "batch data") into Kafka; unless the file is very small.

An improvised pub-sub mechanism for Kafka could be to (a) write the file into a persistent store outside of kafka (b) publishing of a message into Kafka about that write so as to enable processing of that file.

If you really needed to have provenance around processing, you could route data processing through Nifi before Flink.

Regards
Milind

On Jul 19, 2016 9:37 PM, "Leith Mudge" <[hidden email]> wrote:

I am currently working on an architecture for a big data streaming and batch processing platform. I am planning on using Apache Kafka for a distributed messaging system to handle data from streaming data sources and then pass on to Apache Flink for stream processing. I would also like to use Flink's batch processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis as a source for Flink batch processing (is this even possible?) or should I just write the batch data to a data store and then process by reading into Flink?

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |

Suneel Marthi

Re: Using Kafka and Flink for batch processing of a batch data source

I meant to respond to this thread yesterday, but got busy with work and slipped me.

This is possible doable using Flink Streaming, others can correct me here.

Assumption: Both the Batch and Streaming processes are reading from a single Kafka topic and by "Batched data", I am assuming its the same data that's being fed to Streaming but aggregated over a longer time period.

This could be done using a Lambda like Architecture.

1. A Kafka topic that's ingesting data to be distributed to various consumers.

2. A Flink Streaming process with a small time window (minutes/seconds) that's ingesting from Kafka and handles data over this small window.

3. Another Flink Streaming process with a very long time window (few hrs ?) that's also ingesting from Kafka and is munging over large time periods of data (think mini-batch that extends Streaming).

This should work and u don't need a separate Batch process. A similar architecture using Spark Streaming (for both batch and streaming) is demonstrated by Cloudera's Oryx 2.0 project - see http://oryx.io

On Thu, Jul 21, 2016 at 12:41 PM, milind parikh <[hidden email]> wrote:

At this point in time, imo, batch processing is not why you should be considering Flink.

That said, I predict that the stream processing (and event processing) will become the dominant methodology; as we begin to gravitate towards "I can't wait; I want it now" phenomenon. In that methodology, I believe Flink represents the cutting edge of what is possible; at this point in time.

Regards
Milind

On Jul 20, 2016 4:57 PM, "Leith Mudge" <[hidden email]> wrote:

Thanks Milind & Till,

This is what I thought from my reading of the documentation but it is nice to have it confirmed by people more knowledgeable.

Supplementary to this question is whether Flink is the best choice for batch processing at this point in time or whether I would be better to look at a more mature and dedicated batch processing engine such as Spark? I do like the choices that adopting the unified programming model outlined in Apache Beam/Google Cloud Dataflow SDK and this purports to have runners for both Flink and Spark.

Regards,

Leith

From: Till Rohrmann <[hidden email]>
Date: Wednesday, 20 July 2016 at 5:05 PM
To: <[hidden email]>
Subject: Re: Using Kafka and Flink for batch processing of a batch data source

At the moment there is also no batch source for Kafka. I'm also not so sure how you would define a batch given a Kafka stream. Only reading till a certain offset? Or maybe until one has read n messages?

I think it's best to write the batch data to HDFS or another batch data store.

Cheers,

Till

On Wed, Jul 20, 2016 at 8:08 AM, milind parikh <[hidden email]> wrote:

It likely does not make sense to publish a file ( "batch data") into Kafka; unless the file is very small.

An improvised pub-sub mechanism for Kafka could be to (a) write the file into a persistent store outside of kafka (b) publishing of a message into Kafka about that write so as to enable processing of that file.

If you really needed to have provenance around processing, you could route data processing through Nifi before Flink.

Regards
Milind

On Jul 19, 2016 9:37 PM, "Leith Mudge" <[hidden email]> wrote:

I am currently working on an architecture for a big data streaming and batch processing platform. I am planning on using Apache Kafka for a distributed messaging system to handle data from streaming data sources and then pass on to Apache Flink for stream processing. I would also like to use Flink's batch processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis as a source for Flink batch processing (is this even possible?) or should I just write the batch data to a data store and then process by reading into Flink?

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |

| All rights in this email and any attached documents or files are expressly reserved. This e-mail, and any files transmitted with it, contains confidential information which may be subject to legal privilege. If you are not the intended recipient, please delete it and notify Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments are free from viruses or similar malicious code and does not accept liability for any consequences to the recipient caused by opening or using this e-mail. For the legal protection of our business, any email sent or received by us may be monitored or intercepted. | Please consider the environment before printing this email. |