Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop). Is union() the best operator for this task or is there a more performant operator for this task? Does union affect the read of already existing elements or it just appends the new ones somewhere? Best, Flavio |
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()" After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP). Stephan On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
|
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. My process is basically like this: Dataset x = ... while(loopCnt < 3){ x = x.join(y).where(0).equalTo(0).with()); accumulated = x.filter(t.f1 == 0); x = x.filter(t.f1!=0); loopCnt++; } Best, Flavio On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
|
Hi Flavio, your example does not contain a union.Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data. TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program. Cheers, Fabian 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))
On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
|
If the loop count of 3 is fixed (or not significantly larger), union should be fine. 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
ok thanks. are there any alternatives to that?may I use accumulators for that? On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
|
Accumulators can be used to collect records, but they are not designed to hold large amounts of data. How many unions do you plan to use in your program?It might work up to a certain point (~10MB) and fail beyond that. 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
In the order of 10 GB..
On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
|
And how many unions would your program use if you would follow the union-in-loop approach? 2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
3 or 4 usually.. On 7 Sep 2015 18:39, "Fabian Hueske" <[hidden email]> wrote:
|
In that case you should go with union. 2015-09-07 19:06 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
Free forum by Nabble | Edit this page |