dataset flatmap with multiple output types

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

dataset flatmap with multiple output types

Jon Stewart
Hello,

I am new to Apache Flink, so my apologies if this is a common question. I have a rather complex operation I'd like to apply to an item in a data set. Conceptually, the operation could produce many types of each data, each one that I'd like to flow into a different result set.

In Flink, it looks like the output of a flatMap operation must be of the same type, so I would need to split my processing up from a complex map operation to several to express the flow. For example, I might want to split a data set of text lines into words as well as individual characters:

val lines: DataSet[String] = // lines of text
val words = lines.flatMap { _.split(" ") }
val chars = lines.flatMap { _.toCharArray() }

Since "words" and "chars" in the example above have the same input DataSet and both have a flatMap operation applied to them, will "lines" only be iterated once and have both operations computed simultaneously? The big problem I have is that my objects are considerably heavier-weight than lines of text, so I really only want to iterate them once while performing multiple operations on them.

Thank in advance,

Jon

Reply | Threaded
Open this post in threaded view
|

Re: dataset flatmap with multiple output types

Alexis Gendronneau
Hi Jon,

I'm pretty sure your input will be processed only once. I may be wrong ( correction needed if so ), but your pipeline should be compiled as :
source --> flatmap(words) -> result /sink
          |---> flatmap(chars) -> result /sink

As your input become streamed, each "line" goes through pipeline and is processed by each flatmap. These flatmap will produce both a result and send it to their sink. So your input is processed twice but at the same time.

hope that help

2016-05-27 22:33 GMT+02:00 Jon Stewart <[hidden email]>:
Hello,

I am new to Apache Flink, so my apologies if this is a common question. I have a rather complex operation I'd like to apply to an item in a data set. Conceptually, the operation could produce many types of each data, each one that I'd like to flow into a different result set.

In Flink, it looks like the output of a flatMap operation must be of the same type, so I would need to split my processing up from a complex map operation to several to express the flow. For example, I might want to split a data set of text lines into words as well as individual characters:

val lines: DataSet[String] = // lines of text
val words = lines.flatMap { _.split(" ") }
val chars = lines.flatMap { _.toCharArray() }

Since "words" and "chars" in the example above have the same input DataSet and both have a flatMap operation applied to them, will "lines" only be iterated once and have both operations computed simultaneously? The big problem I have is that my objects are considerably heavier-weight than lines of text, so I really only want to iterate them once while performing multiple operations on them.

Thank in advance,

Jon




--
Reply | Threaded
Open this post in threaded view
|

Re: dataset flatmap with multiple output types

rmetzger0
Hi,
Alexis is right. The original data set is only read once and the two flatMaps run in parallel on multiple machines in the cluster.

Regards,
Robert

On Fri, May 27, 2016 at 11:10 PM, Alexis Gendronneau <[hidden email]> wrote:
Hi Jon,

I'm pretty sure your input will be processed only once. I may be wrong ( correction needed if so ), but your pipeline should be compiled as :
source --> flatmap(words) -> result /sink
          |---> flatmap(chars) -> result /sink

As your input become streamed, each "line" goes through pipeline and is processed by each flatmap. These flatmap will produce both a result and send it to their sink. So your input is processed twice but at the same time.

hope that help

2016-05-27 22:33 GMT+02:00 Jon Stewart <[hidden email]>:
Hello,

I am new to Apache Flink, so my apologies if this is a common question. I have a rather complex operation I'd like to apply to an item in a data set. Conceptually, the operation could produce many types of each data, each one that I'd like to flow into a different result set.

In Flink, it looks like the output of a flatMap operation must be of the same type, so I would need to split my processing up from a complex map operation to several to express the flow. For example, I might want to split a data set of text lines into words as well as individual characters:

val lines: DataSet[String] = // lines of text
val words = lines.flatMap { _.split(" ") }
val chars = lines.flatMap { _.toCharArray() }

Since "words" and "chars" in the example above have the same input DataSet and both have a flatMap operation applied to them, will "lines" only be iterated once and have both operations computed simultaneously? The big problem I have is that my objects are considerably heavier-weight than lines of text, so I really only want to iterate them once while performing multiple operations on them.

Thank in advance,

Jon




--