Union/append performance question

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Union/append performance question

Flavio Pompermaier
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Stephan Ewen
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Flavio Pompermaier
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Fabian Hueske-2
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio





Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Flavio Pompermaier
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio







Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Fabian Hueske-2
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio








Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Flavio Pompermaier

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio








Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Fabian Hueske-2
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?



2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio









Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Flavio Pompermaier
In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?



2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio











Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Fabian Hueske-2
And how many unions would your program use if you would follow the union-in-loop approach?

2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?



2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio












Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Flavio Pompermaier

3 or 4 usually..

On 7 Sep 2015 18:39, "Fabian Hueske" <[hidden email]> wrote:
And how many unions would your program use if you would follow the union-in-loop approach?

2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?



2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio












Reply | Threaded
Open this post in threaded view
|

Re: Union/append performance question

Fabian Hueske-2
In that case you should go with union.

2015-09-07 19:06 GMT+02:00 Flavio Pompermaier <[hidden email]>:

3 or 4 usually..

On 7 Sep 2015 18:39, "Fabian Hueske" <[hidden email]> wrote:
And how many unions would your program use if you would follow the union-in-loop approach?

2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?



2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. 
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
   x = x.join(y).where(0).equalTo(0).with());
   accumulated = x.filter(t.f1 == 0);
   x =  x.filter(t.f1!=0);
   loopCnt++;
}

Best,
Flavio


On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan


On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio