Object reuse in DataStreams

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Object reuse in DataStreams

Urs Schoenenberger
Hi all,

we came across some interesting behaviour today.
We enabled object reuse on a streaming job that looks like this:

stream = env.addSource(source)
stream.map(mapFnA).addSink(sinkA)
stream.map(mapFnB).addSink(sinkB)

Operator chaining is enabled, so the optimizer fuses all operations into
a single slot.
The same object reference gets passed to both mapFnA and mapFnB. This
makes sense when I think about the internal implementation, but it still
came as a bit of a surprise since the object reuse docs (for batch -
there are no official ones for streaming, right?) don't really deal with
splitting the DataSet/DataStream. I guess my case is *technically*
covered by the documented warning that it is unsafe to reuse an object
that has already been collected, only in this case this reuse is
"hidden" behind the stream definition DSL.

Is this the expected behaviour? Is object reuse for DataStreams
encouraged at all or is it more of a "hidden beta" feature until FLIP-21
is officially finished?

Best,
Urs

--
Urs Schönenberger - [hidden email]

TNG Technology Consulting GmbH, Beta-Straße 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082
Reply | Threaded
Open this post in threaded view
|

Re: Object reuse in DataStreams

vino yang
Hi Urs,

I think Flink does not encourage to use "object reuse" feature, because in the documentation, it warn the user it may course bug when the user-code function of an operation is not aware of this behavior[1].

The "object reuse" is runtime behavior and it's configuration item belongs `ExecutionConfig` (this class both for batch and streaming), so it takes efforts for both batch and streaming[1].

"Object reuse" feature is not perfect, the main use case is for batch[2] and FLIP-21 tries to promote this feature gracefully, but the current state is "Under discussion".


Thanks, vino.

2018-07-17 20:50 GMT+08:00 Urs Schoenenberger <[hidden email]>:
Hi all,

we came across some interesting behaviour today.
We enabled object reuse on a streaming job that looks like this:

stream = env.addSource(source)
stream.map(mapFnA).addSink(sinkA)
stream.map(mapFnB).addSink(sinkB)

Operator chaining is enabled, so the optimizer fuses all operations into
a single slot.
The same object reference gets passed to both mapFnA and mapFnB. This
makes sense when I think about the internal implementation, but it still
came as a bit of a surprise since the object reuse docs (for batch -
there are no official ones for streaming, right?) don't really deal with
splitting the DataSet/DataStream. I guess my case is *technically*
covered by the documented warning that it is unsafe to reuse an object
that has already been collected, only in this case this reuse is
"hidden" behind the stream definition DSL.

Is this the expected behaviour? Is object reuse for DataStreams
encouraged at all or is it more of a "hidden beta" feature until FLIP-21
is officially finished?

Best,
Urs

--
Urs Schönenberger - [hidden email]

TNG Technology Consulting GmbH, Beta-Straße 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082