Dear FLINK community. I was wondering what would be the recommended (best?) way to achieve some kind of file conversion. That runs in parallel on all available Flink Nodes, since it it "embarrassingly parallel" (no dependency between files). Say, I have a HDFS folder that contains multiple structured text-files containing (x,y) pairs (think of CVS). For each of these files I want to do (individual for each file) the following: * Read file from HDFS * Extract dataset(s) from file (e.g. list of (x,y) pairs) * Apply some filter (e.g. smoothing) * Do some pattern recognition on smoothed data * Write results back to HDFS (different format) Would the following be a good idea? DataSource<String> fileList = ... // contains list of file names in HDFS // For each "filename" in list do... DataSet<FeatureList> featureList = fileList .flatMap(new ReadDataSetFromFile()) // flatMap because there might multiple DataSets in a file .map(new Smoothing()) .map(new FindPatterns()); featureList.writeAsFormattedText( ... ) I have the feeling that Flink does not distribute the independent tasks on the available nodes but executes everything on only one node. Cheers Tim |
Hi Tim, depending on how you create the In contrast, if you had a parallel data source which would consist of multiple source task, then these tasks would be independent and spread out across your cluster. In this case, every flat map task would have a single distinct source task as input. When the flat map tasks are deployed they would be deployed on the machine where their corresponding source is running. Since the source tasks are spread out across the cluster, the flat map tasks would be spread out as well. What you could do to mitigate your problem is to start the cluster with as many slots as your maximum degree of parallelism is. That way, you’ll utilize all cluster resources. I hope this clarifies a bit why you observe that tasks tend to cluster on a single machine. Cheers, On Tue, Feb 23, 2016 at 1:49 PM, Tim Conrad <[hidden email]> wrote:
|
Hi Till (and others).
Thank you very much for your helpful answer. On 23.02.2016 14:20, Till Rohrmann wrote: [...] In contrast, if you had a parallel data source which would consist of multiple source task, then these tasks would be independent and spread out across your cluster [...] Can you please send me a link to an example or to the respective Flink API doc, where I can see which is a parallel data source and how to create it with multiple source tasks? A simple Google search did not provide me with an answer (maybe I used the wrong key words, though...). Cheers Tim On 23.02.2016 14:20, Till Rohrmann
wrote:
|
Hi Tim, unfortunately, this is not documented explicitly as far as I know. For the I hope this helps. Cheers, On Tue, Feb 23, 2016 at 3:44 PM, Tim Conrad <[hidden email]> wrote:
|
Hello,
> // For each "filename" in list do... > DataSet<FeatureList> featureList = fileList > .flatMap(new ReadDataSetFromFile()) // flatMap because there > might multiple DataSets in a file What happens if you just insert .rebalance() before the flatMap? > This kind of DataSource will only be executed > with a degree of parallelism of 1. The source will send it’s collection > elements in a round robin fashion to the downstream operators which are > executed with a higher parallelism. So when Flink schedules the downstream > operators, it will try to place them close to their inputs. Since all flat > map operators have the single data source task as an input, they will be > deployed on the same machine if possible. Sorry, I'm a little confused here. Do you mean that the flatMap will have a high parallelism, but all instances on a single machine? Because I tried to reproduce the situation where I have a non-parallel data source and then a flatMap, and the plan shows that the flatMap actually has parallelism 1, which would be an alternative explanation to the original problem that it gets executed on a single machine. Then, if I insert .rebalance() after the source, then a "Partition" operation appears between the source and the flatMap, and the flatMap has a high parallelism. I think this should also solve the problem, without having to write a parallel data source. Best, Gábor |
If I’m not mistaken, then this shouldn’t solve the scheduling peculiarity of Flink. Flink will still deploy the tasks of the flat map operation to the machine where the source task is running. Only after this machine has no more slots left, other machines will be used as well. I think that you don’t need an explicit Cheers, On Wed, Feb 24, 2016 at 4:01 PM, Gábor Gévay <[hidden email]> wrote: Hello, |
Dear Till and others.
I solved the issue by using the strategy suggested by Till like this: List<String> fileListOfSpectra = ... SplittableList<String> fileListOfSpectraSplitable = new SplittableList<String>( fileListOfSpectra ); DataSource<String> fileListOfSpectraDataSource = env.fromParallelCollection( fileListOfSpectraSplitable, String.class ); and then - as before - DataSet<Peaklist> peakLists = fileListOfSpectraDataSource .flatMap(new ReadDataFromFile()) ... (Find the source for the class "SplittableList" below). Now FLINK distributes the tasks to all available FLINK nodes. Thanks for the help! Cheers Tim On 24.02.2016 16:30, Till Rohrmann wrote:
|
Free forum by Nabble | Edit this page |