(DEPRECATED) Apache Flink User Mailing List archive.

Union/append performance question

Classic

List

Threaded

12 messages Options

Flavio Pompermaier

Union/append performance question

Hi to all,

I have a job where I have to incrementally add Tuples to a dataset (in a while loop).

Is union() the best operator for this task or is there a more performant operator for this task?

Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,

Flavio

Stephan Ewen

Re: Union/append performance question

Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:

Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Flavio Pompermaier

Re: Union/append performance question

Hi Stephan,

thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..

My process is basically like this:

Dataset x = ...

while(loopCnt < 3){

x = x.join(y).where(0).equalTo(0).with());

accumulated = x.filter(t.f1 == 0);

x = x.filter(t.f1!=0);

loopCnt++;

}

Best,

Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:

Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Fabian Hueske-2

Re: Union/append performance question

Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:

Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Flavio Pompermaier

Re: Union/append performance question

Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:

Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Fabian Hueske-2

Re: Union/append performance question

If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:

Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Flavio Pompermaier

Re: Union/append performance question

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:

If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Fabian Hueske-2

Re: Union/append performance question

Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?

2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:

ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Flavio Pompermaier

Re: Union/append performance question

In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:

Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?

2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:
ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Fabian Hueske-2

Re: Union/append performance question

And how many unions would your program use if you would follow the union-in-loop approach?

2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:

In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?

2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:
ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Flavio Pompermaier

Re: Union/append performance question

3 or 4 usually..

On 7 Sep 2015 18:39, "Fabian Hueske" <[hidden email]> wrote:

And how many unions would your program use if you would follow the union-in-loop approach?

2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?

2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:
ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio

Fabian Hueske-2

Re: Union/append performance question

In that case you should go with union.

2015-09-07 19:06 GMT+02:00 Flavio Pompermaier <[hidden email]>:

3 or 4 usually..

On 7 Sep 2015 18:39, "Fabian Hueske" <[hidden email]> wrote:
And how many unions would your program use if you would follow the union-in-loop approach?

2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <[hidden email]>:
In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <[hidden email]> wrote:
Accumulators can be used to collect records, but they are not designed to hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?

2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <[hidden email]>:
ok thanks. are there any alternatives to that?may I use accumulators for that?

On 7 Sep 2015 17:47, "Fabian Hueske" <[hidden email]> wrote:
If the loop count of 3 is fixed (or not significantly larger), union should be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))

On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one data source and input format that reads multiple small files instead of adding one data source for each tiny file and apply union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you should be fine. If it exceeds say 32 it might be worth thinking about your program.

Cheers, Fabian

2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan,
thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now..
My process is basically like this:

Dataset x = ...
while(loopCnt < 3){
x = x.join(y).where(0).equalTo(0).with());
accumulated = x.filter(t.f1 == 0);
x = x.filter(t.f1!=0);
loopCnt++;
}

Best,
Flavio

On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()"

After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute the data set that you union with. Hence, for incremental computations, union() is probably not a good choice, unless you persist intermediate data (seamless support for that is WIP).

Stephan

On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
I have a job where I have to incrementally add Tuples to a dataset (in a while loop).
Is union() the best operator for this task or is there a more performant operator for this task?
Does union affect the read of already existing elements or it just appends the new ones somewhere?

Best,
Flavio