Hello
I have a question about the programming of user defined functions, is it still like in old Stratosphere times the case that object creation should be avoided al all cost? Because in some of the examples there are now Tuples and other objects created before returning them. I gonna have an at least 6 step streaming plan and I am going to use Pojos. Is it performance wise a big improvement to define one big pojo that can be used by all the steps or better to have smaller ones to send less data but create more objects. Thanks Michael |
Hello Michael,
every time you code a Java program you should avoid object creation if you want an efficient program, because every created object needs to be garbage collected later (which slows down your program performance). You can have small Pojos, just try to avoid the call "new" in your functions: Instead of: class Mapper implements MapFunction<String,Pojo> { public Pojo map(String s) { Pojo p = new Pojo(); p.f = s; } } do: class Mapper implements MapFunction<String,Pojo> { private Pojo p = new Pojo(); public Pojo map(String s) { p.f = s; } } Then an object is only created once per Mapper and not per record. Hope this helps. Regards, Timo On 12.08.2015 11:53, Michael Huelfenhaus wrote: > Hello > > I have a question about the programming of user defined functions, is it still like in old Stratosphere times the case that object creation should be avoided al all cost? Because in some of the examples there are now Tuples and other objects created before returning them. > > I gonna have an at least 6 step streaming plan and I am going to use Pojos. Is it performance wise a big improvement to define one big pojo that can be used by all the steps or better to have smaller ones to send less data but create more objects. > > Thanks > Michael |
Hey Timo,
yes that is what I needed to know. Thanks - Michael Am 12.08.2015 um 12:44 schrieb Timo Walther <[hidden email]>: > Hello Michael, > > every time you code a Java program you should avoid object creation if you want an efficient program, because every created object needs to be garbage collected later (which slows down your program performance). > You can have small Pojos, just try to avoid the call "new" in your functions: > > Instead of: > > class Mapper implements MapFunction<String,Pojo> { > public Pojo map(String s) { > Pojo p = new Pojo(); > p.f = s; > } > } > > do: > > class Mapper implements MapFunction<String,Pojo> { > private Pojo p = new Pojo(); > public Pojo map(String s) { > p.f = s; > } > } > > Then an object is only created once per Mapper and not per record. > > Hope this helps. > > Regards, > Timo > > > > On 12.08.2015 11:53, Michael Huelfenhaus wrote: >> Hello >> >> I have a question about the programming of user defined functions, is it still like in old Stratosphere times the case that object creation should be avoided al all cost? Because in some of the examples there are now Tuples and other objects created before returning them. >> >> I gonna have an at least 6 step streaming plan and I am going to use Pojos. Is it performance wise a big improvement to define one big pojo that can be used by all the steps or better to have smaller ones to send less data but create more objects. >> >> Thanks >> Michael > |
Thanks Timo That is a good interview question Best regards Hawin On Thu, Aug 13, 2015 at 1:11 AM, Michael Huelfenhaus <[hidden email]> wrote: Hey Timo, |
In reply to this post by Timo Walther
Any insight about these 2 questions..? On 12 Aug 2015 17:38, "Flavio Pompermaier" <[hidden email]> wrote:
|
I think Timo answered both questions (quoting Michael: "Hey Timo, yes that is what I needed to know. Thanks"). 2015-08-14 17:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
In reply to this post by Flavio Pompermaier
Hi! (1) A mapper is created once per parallel task. So if you create a program that runs a map() transformation with a parallelism of n, you will have n mapper instances in the cluster. Some may be on the same TaskManager, if the TaskManager has multiple slots. (2) I would really like that. But it means Java has to deal with both managed and unmanaged memory at the same time, which is quite a heavy addition. C# has some form of support for that. BTW: Where did you originally post these questions? I have not seen them before... On Fri, Aug 14, 2015 at 5:43 PM, Flavio Pompermaier <[hidden email]> wrote:
|
In reply to this post by Flavio Pompermaier
O sorry, Flavio! I didn't see Hawins questions :-(2015-08-14 17:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
In reply to this post by Stephan Ewen
Hi Stephan thanks for the reply! I was convinced to have post those questions in this thread as 3rd or 4th message..isn't it? On 14 Aug 2015 17:57, "Stephan Ewen" <[hidden email]> wrote:
|
Yes, map() is like a convenience function around mapPartition(). On Fri, Aug 14, 2015 at 6:09 PM, Flavio Pompermaier <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |