Hello,
I am new to Apache Flink, so my apologies if this is a common question. I have a rather complex operation I'd like to apply to an item in a data set. Conceptually, the operation could produce many types of each data, each one that I'd like to flow into a different result set. In Flink, it looks like the output of a flatMap operation must be of the same type, so I would need to split my processing up from a complex map operation to several to express the flow. For example, I might want to split a data set of text lines into words as well as individual characters: val lines: DataSet[String] = // lines of text val words = lines.flatMap { _.split(" ") } val chars = lines.flatMap { _.toCharArray() } Since "words" and "chars" in the example above have the same input DataSet and both have a flatMap operation applied to them, will "lines" only be iterated once and have both operations computed simultaneously? The big problem I have is that my objects are considerably heavier-weight than lines of text, so I really only want to iterate them once while performing multiple operations on them. Thank in advance, Jon |
Hi Jon, I'm pretty sure your input will be processed only once. I may be wrong ( correction needed if so ), but your pipeline should be compiled as :2016-05-27 22:33 GMT+02:00 Jon Stewart <[hidden email]>: Hello, -- |
Hi, Alexis is right. The original data set is only read once and the two flatMaps run in parallel on multiple machines in the cluster. Regards, Robert On Fri, May 27, 2016 at 11:10 PM, Alexis Gendronneau <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |