RE: Left join with unbalanced dataset

Posted by LINZ, Arnaud on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Left-join-with-unbalanced-dataset-tp4572p4576.html

Hi,

Thanks, I can’t believe I missed the outer join operators… Will try them and will keep you informed.

I use the “official” 0.10 release from the maven repo. The off-heap memory I use is the one HDFS I/O uses (codec, DFSOutputstream threads…), but I don’t have many open files at once, and doubling the amount of memory did not solve the problem.

Arnaud

 

 

De : [hidden email] [mailto:[hidden email]] De la part de Stephan Ewen
Envoyé : dimanche 31 janv
ier 2016 20:57
À : [hidden email]
Objet : Re: Left join with unbalanced dataset

 

Hi!

 

YARN killing the application seems strange. The memory use that YARN sees should not change even when one node gets a lot or data.

 

Can you share what version of Flink (plus commit hash) you are using and whether you use off-heap memory or not?

 

Thanks,

Stephan

 

 

On Sun, Jan 31, 2016 at 10:47 AM, Till Rohrmann <[hidden email]> wrote:

Hi Arnaud,

 

the unmatched elements of A will only end up on the same worker node if they all share the same key. Otherwise, they will be evenly spread out across your cluster. However, I would also recommend you to use Flink's leftOuterJoin.

 

Cheers,

Till

 

On Sun, Jan 31, 2016 at 5:27 AM, Chiwan Park <[hidden email]> wrote:

Hi Arnaud,

To join two datasets, the community recommends using join operation rather than cogroup operation. For left join, you can use leftOuterJoin method. Flink’s optimizer decides distributed join execution strategy using some statistics of the datasets such as size of the dataset. Additionally, you can set join hint to help optimizer decide the strategy.

In transformations section [1] of Flink documentation, you can find about outer join operation in detail.

I hope this helps.

[1]: https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations

Regards,
Chiwan Park

> On Jan 30, 2016, at 6:43 PM, LINZ, Arnaud <[hidden email]> wrote:
>
> Hello,
>
> I have a very big dataset A to left join with a dataset B that is half its size. That is to say, half of A records will be matched with one record of B, and the other half with null values.
>
> I used a CoGroup for that, but my batch fails because yarn kills the container due to memory problems.
>
> I guess that’s because one worker will get half of A dataset (the unmatched ones), and that’s too much for a single JVM
>
> Am I right in my diagnostic ? Is there a better way to left join unbalanced datasets ?
>
> Best regards,
>
> Arnaud
>
>
>

> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.