Multiple keys in reduceGroup ?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple keys in reduceGroup ?

LINZ, Arnaud

Hello,

 

Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.

 

My code has the following pattern (mix of java & pseudo-code in []) :

 

inputDataSet [of InputRecord]

.joinWithTiny(referencesDataSet [of Reference])

.where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])

.groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])

.sortGroup([DateKeySelector], Order.ASCENDING)

.reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {

@Override

       public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {

             // Issue : all values do not share the same key

      final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();

             for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }

               

final long primkey = listValues.get(0).f0.getPrimaryKey();

       for (int i = 1; i < listValues.size(); i++) {

            if (listValues.get(i).f0.getPrimaryKey() != primkey) {

                      throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());

                    è This exception is fired !

           }

        }

}

}) ;

 

I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).

 

The sortGroup is not the cause of the problem, as I do get the same error without it.

 

Have I misunderstood the grouping concept or is it really an awful bug?

 

Best regards,

Arnaud

 

 

 




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Multiple keys in reduceGroup ?

Stephan Ewen
Hi!

You are checking for equality / inequality with "!=" - can you check with "equals()" ?

The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal.

Stephan


On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud <[hidden email]> wrote:

Hello,

 

Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.

 

My code has the following pattern (mix of java & pseudo-code in []) :

 

inputDataSet [of InputRecord]

.joinWithTiny(referencesDataSet [of Reference])

.where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])

.groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])

.sortGroup([DateKeySelector], Order.ASCENDING)

.reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {

@Override

       public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {

             // Issue : all values do not share the same key

      final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();

             for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }

               

final long primkey = listValues.get(0).f0.getPrimaryKey();

       for (int i = 1; i < listValues.size(); i++) {

            if (listValues.get(i).f0.getPrimaryKey() != primkey) {

                      throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());

                    è This exception is fired !

           }

        }

}

}) ;

 

I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).

 

The sortGroup is not the cause of the problem, as I do get the same error without it.

 

Have I misunderstood the grouping concept or is it really an awful bug?

 

Best regards,

Arnaud

 

 

 




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.

Reply | Threaded
Open this post in threaded view
|

Re: Multiple keys in reduceGroup ?

Aljoscha Krettek
Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the ExecutionEnvironment?

Cheers,
Aljoscha

> On 22 Oct 2015, at 12:31, Stephan Ewen <[hidden email]> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with "equals()" ?
>
> The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud <[hidden email]> wrote:
> Hello,
>
>  
>
> Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.
>
>  
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>  
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {
>
> @Override
>
>        public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {
>
>              // Issue : all values do not share the same key
>
>       final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();
>
>              for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }
>
>                
>
> final long primkey = listValues.get(0).f0.getPrimaryKey();
>
>        for (int i = 1; i < listValues.size(); i++) {
>
>             if (listValues.get(i).f0.getPrimaryKey() != primkey) {
>
>                       throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());
>
>                     è This exception is fired !
>
>            }
>
>         }
>
> }
>
> }) ;
>
>  
>
> I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).
>
>  
>
> The sortGroup is not the cause of the problem, as I do get the same error without it.
>
>  
>
> Have I misunderstood the grouping concept or is it really an awful bug?
>
>  
>
> Best regards,
>
> Arnaud
>
>  
>
>  
>
>  
>
>
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
>

Reply | Threaded
Open this post in threaded view
|

Re: Multiple keys in reduceGroup ?

Till Rohrmann
If not, could you provide us with the program and test data to reproduce the error?

Cheers,
Till

On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the ExecutionEnvironment?

Cheers,
Aljoscha
> On 22 Oct 2015, at 12:31, Stephan Ewen <[hidden email]> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with "equals()" ?
>
> The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud <[hidden email]> wrote:
> Hello,
>
>
>
> Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.
>
>
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {
>
> @Override
>
>        public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {
>
>              // Issue : all values do not share the same key
>
>       final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();
>
>              for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }
>
>
>
> final long primkey = listValues.get(0).f0.getPrimaryKey();
>
>        for (int i = 1; i < listValues.size(); i++) {
>
>             if (listValues.get(i).f0.getPrimaryKey() != primkey) {
>
>                       throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());
>
>                     è This exception is fired !
>
>            }
>
>         }
>
> }
>
> }) ;
>
>
>
> I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).
>
>
>
> The sortGroup is not the cause of the problem, as I do get the same error without it.
>
>
>
> Have I misunderstood the grouping concept or is it really an awful bug?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
>
>
>
>
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
>


Reply | Threaded
Open this post in threaded view
|

RE: Multiple keys in reduceGroup ?

LINZ, Arnaud

Hi,

 

I was using primitive types, and EnableObjectReuse was turned on.  My next move was to turn it off, and it did solved the problem.

It also increased execution time by 10%, but it’s hard to say if this overhead is due to the copy or to the change of behavior of the reduceGroup algorithm once it get the right data.

 

Since I never modify my objects, why object reuse isn’t working ?

 

Best regards,

Arnaud

 

 

De : Till Rohrmann [mailto:[hidden email]]
Envoyé : jeudi 22 octobre 2015 12:36
À : [hidden email]
Objet : Re: Multiple keys in reduceGroup ?

 

If not, could you provide us with the program and test data to reproduce the error?

 

Cheers,

Till

 

On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the ExecutionEnvironment?

Cheers,
Aljoscha

> On 22 Oct 2015, at 12:31, Stephan Ewen <[hidden email]> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with "equals()" ?
>
> The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud <[hidden email]> wrote:
> Hello,
>
>
>
> Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.
>
>
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {
>
> @Override
>
>        public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {
>
>              // Issue : all values do not share the same key
>
>       final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();
>
>              for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }
>
>
>
> final long primkey = listValues.get(0).f0.getPrimaryKey();
>
>        for (int i = 1; i < listValues.size(); i++) {
>
>             if (listValues.get(i).f0.getPrimaryKey() != primkey) {
>
>                       throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());
>
>                     è This exception is fired !
>
>            }
>
>         }
>
> }
>
> }) ;
>
>
>
> I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).
>
>
>
> The sortGroup is not the cause of the problem, as I do get the same error without it.
>
>
>
> Have I misunderstood the grouping concept or is it really an awful bug?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
>
>
>
>
>

> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
>

 

Reply | Threaded
Open this post in threaded view
|

Re: Multiple keys in reduceGroup ?

Stephan Ewen
With object reuse activated, Flink heavily reuses objects. Each call to the Iterator in the reduceGroup function gives back one of the same two objects, with has been filled with different contents.

Your list of all values will effectively only contain two different objects.

Further more, the look-ahead, which determines that a new key starts, will also reuse one of these objects, which is why some elements in your list have their contents already overwritten with the look-ahead key.

The contract for object reuse mode is the following: An object is only valid until you request a new value from the iterator. After that, the object's contents may have changed due to reuse.

This effectively means accumulating objects in a list with object reuse mode requires you to manually copy them into the list.



On Thu, Oct 22, 2015 at 1:30 PM, LINZ, Arnaud <[hidden email]> wrote:

Hi,

 

I was using primitive types, and EnableObjectReuse was turned on.  My next move was to turn it off, and it did solved the problem.

It also increased execution time by 10%, but it’s hard to say if this overhead is due to the copy or to the change of behavior of the reduceGroup algorithm once it get the right data.

 

Since I never modify my objects, why object reuse isn’t working ?

 

Best regards,

Arnaud

 

 

De : Till Rohrmann [mailto:[hidden email]]
Envoyé : jeudi 22 octobre 2015 12:36
À : [hidden email]
Objet : Re: Multiple keys in reduceGroup ?

 

If not, could you provide us with the program and test data to reproduce the error?

 

Cheers,

Till

 

On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the ExecutionEnvironment?

Cheers,
Aljoscha

> On 22 Oct 2015, at 12:31, Stephan Ewen <[hidden email]> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with "equals()" ?
>
> The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud <[hidden email]> wrote:
> Hello,
>
>
>
> Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.
>
>
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {
>
> @Override
>
>        public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {
>
>              // Issue : all values do not share the same key
>
>       final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();
>
>              for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }
>
>
>
> final long primkey = listValues.get(0).f0.getPrimaryKey();
>
>        for (int i = 1; i < listValues.size(); i++) {
>
>             if (listValues.get(i).f0.getPrimaryKey() != primkey) {
>
>                       throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());
>
>                     è This exception is fired !
>
>            }
>
>         }
>
> }
>
> }) ;
>
>
>
> I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).
>
>
>
> The sortGroup is not the cause of the problem, as I do get the same error without it.
>
>
>
> Have I misunderstood the grouping concept or is it really an awful bug?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
>
>
>
>
>

> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
>

 


Reply | Threaded
Open this post in threaded view
|

Re: Multiple keys in reduceGroup ?

Till Rohrmann
In reply to this post by LINZ, Arnaud

You don’t modify the objects, however, the ReusingKeyGroupedIterator, which is the iterator you have in your reduce function, does. Internally it uses two objects, in your case of type Tuple2<InputRecord, Reference>, to deserialize the input records. These two objects are alternately returned when you call next on the iterator. Since you only store references to these two objects in your ArrayList, you will see any changes made to these two objects.

However, this only explains why the values of your elements change and not the key. To understand why you observe different keys in your group you have to know that the ReusingKeyGroupedIterator does a look ahead to see whether the next element has the same key value. The look ahead is stored in one of the two objects. When the iterator detects that the next element has a new key, then it will finish the iterator. However, you’ll will see the key value of the next group in half of your elements.

If you want to accumulate input data while using reuse object mode you should copy the input elements.


On Thu, Oct 22, 2015 at 1:30 PM, LINZ, Arnaud <[hidden email]> wrote:

Hi,

 

I was using primitive types, and EnableObjectReuse was turned on.  My next move was to turn it off, and it did solved the problem.

It also increased execution time by 10%, but it’s hard to say if this overhead is due to the copy or to the change of behavior of the reduceGroup algorithm once it get the right data.

 

Since I never modify my objects, why object reuse isn’t working ?

 

Best regards,

Arnaud

 

 

De : Till Rohrmann [mailto:[hidden email]]
Envoyé : jeudi 22 octobre 2015 12:36
À : [hidden email]
Objet : Re: Multiple keys in reduceGroup ?

 

If not, could you provide us with the program and test data to reproduce the error?

 

Cheers,

Till

 

On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the ExecutionEnvironment?

Cheers,
Aljoscha

> On 22 Oct 2015, at 12:31, Stephan Ewen <[hidden email]> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with "equals()" ?
>
> The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud <[hidden email]> wrote:
> Hello,
>
>
>
> Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.
>
>
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {
>
> @Override
>
>        public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {
>
>              // Issue : all values do not share the same key
>
>       final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();
>
>              for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }
>
>
>
> final long primkey = listValues.get(0).f0.getPrimaryKey();
>
>        for (int i = 1; i < listValues.size(); i++) {
>
>             if (listValues.get(i).f0.getPrimaryKey() != primkey) {
>
>                       throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());
>
>                     è This exception is fired !
>
>            }
>
>         }
>
> }
>
> }) ;
>
>
>
> I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).
>
>
>
> The sortGroup is not the cause of the problem, as I do get the same error without it.
>
>
>
> Have I misunderstood the grouping concept or is it really an awful bug?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
>
>
>
>
>

> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
>

 


Reply | Threaded
Open this post in threaded view
|

RE: Multiple keys in reduceGroup ?

LINZ, Arnaud

Hi,

 

Thanks a lot for the explanation. I cannot even say that it wasn’t stated in the documentation, I’ve simply missed the iterator part :

 

“by default, user defined functions (like map() or reduce()) are getting new objects on each call (or through an iterator). So it is possible to keep references to the objects inside the function (for example in a List).

There is a switch at the ExectionConfig which allows users to enable the object reuse mode:

env.getExecutionConfig().enableObjectReuse()

For mutable types, Flink will reuse object instances. In practice that means that a map() function will always receive the same object instance (with its fields set to new values). The object reuse mode will lead to better performance because fewer objects are created, but the user has to manually take care of what they are doing with the object references.”

Greetings,

Arnaud

 

De : Till Rohrmann [mailto:[hidden email]]
Envoyé : jeudi 22 octobre 2015 13:45
À : u
[hidden email]
Objet : Re: Multiple keys in reduceGroup ?

 

You don’t modify the objects, however, the ReusingKeyGroupedIterator, which is the iterator you have in your reduce function, does. Internally it uses two objects, in your case of type Tuple2<InputRecord, Reference>, to deserialize the input records. These two objects are alternately returned when you call next on the iterator. Since you only store references to these two objects in your ArrayList, you will see any changes made to these two objects.

However, this only explains why the values of your elements change and not the key. To understand why you observe different keys in your group you have to know that the ReusingKeyGroupedIterator does a look ahead to see whether the next element has the same key value. The look ahead is stored in one of the two objects. When the iterator detects that the next element has a new key, then it will finish the iterator. However, you’ll will see the key value of the next group in half of your elements.

If you want to accumulate input data while using reuse object mode you should copy the input elements.

 

On Thu, Oct 22, 2015 at 1:30 PM, LINZ, Arnaud <[hidden email]> wrote:

Hi,

 

I was using primitive types, and EnableObjectReuse was turned on.  My next move was to turn it off, and it did solved the problem.

It also increased execution time by 10%, but it’s hard to say if this overhead is due to the copy or to the change of behavior of the reduceGroup algorithm once it get the right data.

 

Since I never modify my objects, why object reuse isn’t working ?

 

Best regards,

Arnaud

 

 

De : Till Rohrmann [mailto:[hidden email]]
Envoyé : jeudi 22 octobre 2015 12:36
À : [hidden email]
Objet : Re: Multiple keys in reduceGroup ?

 

If not, could you provide us with the program and test data to reproduce the error?

 

Cheers,

Till

 

On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the ExecutionEnvironment?

Cheers,
Aljoscha

> On 22 Oct 2015, at 12:31, Stephan Ewen <[hidden email]> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with "equals()" ?
>
> The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud <[hidden email]> wrote:
> Hello,
>
>
>
> Trying to understand why my code was giving strange results, I’ve ended up adding “useless” controls in my code and came with what seems to me a bug. I group my dataset according to a key, but in the reduceGroup function I am passed values with different keys.
>
>
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2<InputRecord, Reference> -> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction<InputRecord, OutputRecord>() {
>
> @Override
>
>        public void reduce(Iterable< Tuple2<InputRecord, Reference>> values,  Collector<OutputRecord> out) throws Exception {
>
>              // Issue : all values do not share the same key
>
>       final List<Tuple2<InputRecord, Reference>> listValues = new ArrayList<Tuple2<InputRecord, Reference>>();
>
>              for (final Tuple2<InputRecord, Reference>value : values) { listValues.add(value); }
>
>
>
> final long primkey = listValues.get(0).f0.getPrimaryKey();
>
>        for (int i = 1; i < listValues.size(); i++) {
>
>             if (listValues.get(i).f0.getPrimaryKey() != primkey) {
>
>                       throw new IllegalStateException(primkey + " != " + listValues.get(i).f0.getPrimaryKey());
>
>                     è This exception is fired !
>
>            }
>
>         }
>
> }
>
> }) ;
>
>
>
> I use the current 0.10 snapshot. The issue appears in local cluster mode unit tests as well as in yarn mode (however it’s ok when I test it with very few elements).
>
>
>
> The sortGroup is not the cause of the problem, as I do get the same error without it.
>
>
>
> Have I misunderstood the grouping concept or is it really an awful bug?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
>
>
>
>
>

> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
>