Hi to all,
I had a very long but useful chat with Fabian and I understood a lot of concepts that was not clear at all to me. We started from the Flink runtime documentation page (https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html) but I discovered that the terminology is very inconsistent and misleading along the page... For example, one of the very first sentences is : "Flink chains operator subtasks together into tasks. Each task is executed by one thread." What I first understood was that every operator can be executed only by a single thread in all the cluster....probably it should be better "one thread per task slot" (at least). Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka parallel instance) of each task and there's no limit to the number of subtasks per slot (and this is not highlighted at all in that document). The only constraint is that they should belong to different tasks (right?). If there's a google doc version of that page I could try to rewrite it down in order to make it easier to understand some parts...however I still have some more questions:
Cheers, Flavio |
Any feedback here..?
On Wed, Apr 5, 2017 at 7:43 PM, Flavio Pompermaier <[hidden email]> wrote:
|
Hi,
sorry for not getting any responses but I think everyone was quite busy with Flink Forward SF. I’m also no expert on the topic but I’ll try and give some answers. Regarding a Google Doc version, I don’t think that there is any. You would have to modify the Markdown version we have in the doc. For the other answers I’ll reuse an example program that consists of Source -> Map -> Sink, with chaining disabled and parallelism 2. We’ll this have three Tasks: Source, Map, and Sink, with each having two subtasks. Let’s denote the subtasks by a number in parenthesis so the first subtask for Source is Source(1), second one is Source(2). I’ll also refer to Source(1) -> Map(1) -> Sink(1) as a slice of the execution graph since these can be executed within one slot. Regarding 1, I think this is true. However, a single slot can execute a complete slice of the execution graph where each subtask (from a different task) would be executed by its own thread. Regarding 2.1, Yes, I think it cannot run multiple subtasks of the same task while it is possible (and in fact done) to execute all the subtasks of a slide in the same slot. Regarding 2.2, This is so to allow executing a pipeline of parallelism 8 using a cluster that has 8 free slots. Basically, each slice fills one slot. Regarding 3, I don’t really have an answer. Regarding 4, Yes, this can get a bit out of hand if you have very long pipelines. Best, Aljoscha
|
Hi Aljoscha,
thanks for the reply, it was not urgent and I was aware of the FF...btw, congratulations for it, I saw many interesting talks! Flink community has grown a lot since it was Stratosphere ;) Just one last question: in many of my use cases it could be helpful to see how many of the created splits were "consumed" by an inputFormat/source. Is it possible to monitor this part somewhere in the dashboards or with a custom metric? On Tue, Apr 18, 2017 at 5:24 PM, Aljoscha Krettek <[hidden email]> wrote:
|
Hi,
there are currently no built-in metrics for InputSplit consumption but I do see that this could be quite helpful. I think you can have a custom RichInputFormat that uses metrics to record stuff, though. I think adding built-in metrics should be possible at this point in the code: https://github.com/apache/flink/blob/8f3d6d239996c83f7cbd102dc8a85ee626a56bf5/flink-runtime/src/main/java/org/apache/flink/runtime/operators/DataSourceTask.java#L144-L144 Best, Aljoscha
|
Free forum by Nabble | Edit this page |