Apache Flink - Difference between operator and function

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Flink - Difference between operator and function

M Singh
Hi:

I am reading the documentation on working with state (https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html) and it states that :

All datastream functions can use managed state, but the raw state interfaces can only be used when implementing operators. Using managed state (rather than raw state) is recommended, since with managed state Flink is able to automatically redistribute state when the parallelism is changed, and also do better memory management.

I wanted to find out 
  1. What's the difference between an operator and a function ? 
  2. What are the raw state interfaces ? Are they checkpoint related interfaces ?

Thanks

Mans
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink - Difference between operator and function

Tzu-Li (Gordon) Tai
Hi Mans,

What's the difference between an operator and a function ? 

An operator in Flink needs to handle processing of watermarks, records, and checkpointing of the operator state.
To implement one, you need to extend the AbstractStreamOperator base class.
It is considered a very low-level API that normal users would not use unless they have very specific needs.
To add an operator to your pipeline, you would use DataStream::transform(…).

Functions are UDFs such as a FlatMapFunction, MapFunction, WindowFunction, etc., and is the typical way Flink users would define transformations on DataStreams / DataSets.
They can be added to your pipeline using specific transform methods for each kind of function, e.g. DataStream::flatMap(…) corresponds to the FlatMapFunction.
User functions are executed by an underlying operator (specifically, the AbstractStreamUdfOperator).
UDFs only expose the abstraction of per-record processing and producing outputs so you don’t have to worry about other complications, for example handling watermarks and checkpointing state.
Any registered state in UDFs are managed state, and will be checkpointed by the underlying operator.

What are the raw state interfaces ? Are they checkpoint related interfaces ?

The raw state interfaces refer to StateInitializationContext and StateSnapshotContext, which is only visible when you directly implement an AbstractStreamOperator.
Through those interfaces, you have additional access to raw operator and keyed state input / output streams on the initializeState and snapshotState methods, which lets you read / write state as a stream of raw bytes.

Hope this helps!

Cheers,
Gordon

On 20 December 2017 at 10:06:34 AM, M Singh ([hidden email]) wrote:

Hi:

I am reading the documentation on working with state (https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html) and it states that :

All datastream functions can use managed state, but the raw state interfaces can only be used when implementing operators. Using managed state (rather than raw state) is recommended, since with managed state Flink is able to automatically redistribute state when the parallelism is changed, and also do better memory management.

I wanted to find out 
  1. What's the difference between an operator and a function ? 
  2. What are the raw state interfaces ? Are they checkpoint related interfaces ?

Thanks

Mans
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink - Difference between operator and function

M Singh
Thanks Gordon for your explanation.  

Mans


On Wednesday, December 20, 2017 2:16 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:


Hi Mans,

What's the difference between an operator and a function ? 

An operator in Flink needs to handle processing of watermarks, records, and checkpointing of the operator state.
To implement one, you need to extend the AbstractStreamOperator base class.
It is considered a very low-level API that normal users would not use unless they have very specific needs.
To add an operator to your pipeline, you would use DataStream::transform(…).

Functions are UDFs such as a FlatMapFunction, MapFunction, WindowFunction, etc., and is the typical way Flink users would define transformations on DataStreams / DataSets.
They can be added to your pipeline using specific transform methods for each kind of function, e.g. DataStream::flatMap(…) corresponds to the FlatMapFunction.
User functions are executed by an underlying operator (specifically, the AbstractStreamUdfOperator).
UDFs only expose the abstraction of per-record processing and producing outputs so you don’t have to worry about other complications, for example handling watermarks and checkpointing state.
Any registered state in UDFs are managed state, and will be checkpointed by the underlying operator.

What are the raw state interfaces ? Are they checkpoint related interfaces ?

The raw state interfaces refer to StateInitializationContext and StateSnapshotContext, which is only visible when you directly implement an AbstractStreamOperator.
Through those interfaces, you have additional access to raw operator and keyed state input / output streams on the initializeState and snapshotState methods, which lets you read / write state as a stream of raw bytes.

Hope this helps!

Cheers,
Gordon

On 20 December 2017 at 10:06:34 AM, M Singh ([hidden email]) wrote:
Hi:

I am reading the documentation on working with state (https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html) and it states that :

All datastream functions can use managed state, but the raw state interfaces can only be used when implementing operators. Using managed state (rather than raw state) is recommended, since with managed state Flink is able to automatically redistribute state when the parallelism is changed, and also do better memory management.

I wanted to find out 
  1. What's the difference between an operator and a function ? 
  2. What are the raw state interfaces ? Are they checkpoint related interfaces ?

Thanks

Mans