Hi I would like to understand the execution model. 1. I have a csv files which is say 10 GB. 2. I created a table from this file. |
Do you really need the large single table created in step 2? If not, what you typically do is that the Csv source first do the common transformations. Then depending on whether the 10 outputs have different processing paths or not, you either do a split() to do individual processing depending on some criteria, or you just have the sink put each record in separate tables. You have full control, at each step along the transformation path whether it can be parallelized or not, and if there are no sequential constraints on your model, then you can easily fill all cores on all hosts quite easily. Even
if you need the step 2 table, I would still just treat that as a
split(), a branch ending in a Sink that does the storage there. No need
to read records from file over and over again, nor to store them first
in step 2 table and read them out again. HTH Niclas On Mon, Feb 19, 2018 at 3:11 AM, Darshan Singh <[hidden email]> wrote:
-- Niclas Hedhman, Software Developer http://zest.apache.org - New Energy for Java |
In reply to this post by Darshan Singh
Thanks for reply. I guess I am not looking for alternate. I am trying to understand what flink does in this scenario and if 10 tasks ar egoing in parallel I am sure they will be reading csv as there is no other way.On Mon, Feb 19, 2018 at 12:48 AM, Niclas Hedhman <[hidden email]> wrote:
|
Hi, this works as follows.The following figures illustrate the difference: TableSource -> Filter3 -> TableSink3 \-> Filter3 -> TableSink3 The underlying problem is that the SQL optimizer cannot translate queries with multiple sinks. Instead, each sink is individually translated and the optimizer does not know that common execution paths could be shared. 2018-02-19 2:19 GMT+01:00 Darshan Singh <[hidden email]>:
|
Thanks Fabian for such detailed explanation. I am using a datset in between so i guess csv is read once. Now to my real issue i have 6 task managers each having 4 cores and i have 2 slots per task manager. Now my csv file is jus 1 gb and i create table and transform to dataset and then run 15 different filters and extra processing which all run in almost parallel. However it fails with error no space left on device on one of the task manager. Space on each task manager is 32 gb in /tmp. So i am not sure why it is running out of space. I do use some joins with othrr tables but those are few megabytes. So i was assuming that somehow all parallel executions were storing data in /tmp and were filling it. So i would like to know wht could be filling space. Thanks On 19 Feb 2018 10:10 am, "Fabian Hueske" <[hidden email]> wrote:
|
Hi, that's a difficult question without knowing the details of your job. This can happen if: The data is written to a temp directory, that can be configured in the ./conf/flink-conf.yaml file. 2018-02-19 11:22 GMT+01:00 Darshan Singh <[hidden email]>:
|
Thanks , is there a metric or other way to know how much space each task/job is taking? Does execution plan has these details? ThanksOn Mon, Feb 19, 2018 at 10:54 AM, Fabian Hueske <[hidden email]> wrote:
|
No, there is no size or cardinality estimation happening at the moment. Best, Fabian 2018-02-19 21:56 GMT+01:00 Darshan Singh <[hidden email]>:
|
Is there any plans for this in future. I could see at the plans and without these stats I am bit lost on what to look for like what are pain points etc. I can see some very obvious things but not too much with these plans. My question is there a guide or document which describes what your plans should look like and what needs to look into this?On Tue, Feb 20, 2018 at 9:34 AM, Fabian Hueske <[hidden email]> wrote:
|
Cardinality and size estimation are fundamental requirements for cost-based query optimization. I hope we will work on this at some point but right now it is not on the roadmap. In case of very complex plans, it might make sense to write an intermediate result to persistent storage and start another query. I don't think there's a good rule of thumb for this because there are many factors that need to be considered (data size, compute resources, operators, etc.). You'd have to experiment yourself. Best, Fabian 2018-02-20 23:52 GMT+01:00 Darshan Singh <[hidden email]>:
|
Free forum by Nabble | Edit this page |