Hello Flink team,
How can i partition and share static state among instances of a streaming operator? I have a huge list of keys and values, which are used to filter tuples in a stream. The list does not change. Currently i am sharing the list with each operator instance via the constructor, although only a subset of the list is required per operator (the assignment of subset to operator instance is known). I cannot use DataSet based functions in a streaming execution environment to assign sub lists. I also cannot use DataStream based partitioning functions as the list is static, i.e. not a DataStream. The dilemma exists as i am mixing static (DataSet type) content with streaming content. Is there any other approach aside from using an additional tool (e.g. distributed cache)? Thanks in advance. Regards Leon |
1)Create a Kafka Topic dedicated to your list of key/values. Inject your values into this topic, partitionned by the keys. So that you recover the keys in Flink.
2) Create a source for the stream of tuple your analysing -> output1 (Tuples).
3) Create a KafkaSource, and parse/recover your key value pairs from this source (e.g a first map operator) : map1 -> output 2 (K,V), then :
a) If you need all key/Value pairs at each operator : broadcast all partitions from the output 1 to the analysis operator
b) if you dont need all key/values pairs, just chain output1 to the analysis operator. Partitioning of K,V pairs will depend on Kafka partitioning strategy, and can be controlled in Flink anyway.
4) The analysis operator : will perform a RichCoFlatMapFunction, and can be Checkpointed. When receiving K,V pairs from output2, store them in a local state. When receiving tuple, should be able to to filter with the help of the local state, and propagate downstream or not.
> Message du 30/05/16 13:41 |
Dear Philippe,
that is exactly what i need. Thank you for the concise explanation. This approach is excellent, as it also permits the values to be easily updated externally. Kind regards Leon 30. May 2016 14:31 by [hidden email]:
|
In reply to this post by Philippe CAPARROY
Aljoscha is working to properly expose this in Flink. The design
document is here: https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit#heading=h.pqg5z6g0mjm7 On Mon, May 30, 2016 at 2:31 PM, Philippe CAPARROY <[hidden email]> wrote: > > Just transform the list in a DataStream. A datastream can be finite. > > > One solution, in the context of a Streaming environment is to use Kafka, or > any other distributed broker, although Flink ships with a KafkaSource. > > > > 1)Create a Kafka Topic dedicated to your list of key/values. Inject your > values into this topic, partitionned by the keys. So that you recover the > keys in Flink. > > > > 2) Create a source for the stream of tuple your analysing -> output1 > (Tuples). > > > > 3) Create a KafkaSource, and parse/recover your key value pairs from this > source (e.g a first map operator) : map1 -> output 2 (K,V), then : > > > > > > > > a) If you need all key/Value pairs at each operator : > broadcast all partitions from the output 1 to the analysis operator > > > > b) if you dont need all key/values pairs, just chain > output1 to the analysis operator. Partitioning of K,V pairs will depend on > Kafka partitioning strategy, and can be controlled in Flink anyway. > > > > 4) The analysis operator : will perform a RichCoFlatMapFunction, and can be > Checkpointed. > > When receiving K,V pairs from output2, store them in a local state. > > When receiving tuple, should be able to to filter with the help of the local > state, and propagate downstream or not. > > > > > > > > > > > > > > > > > > > > > > > >> Message du 30/05/16 13:41 >> De : [hidden email] >> A : "User" <[hidden email]> >> Copie à : >> Objet : Elegantly sharing state in a streaming environment > >> >>Hello Flink team, > > How can i partition and share static state among instances of a streaming > operator? > > I have a huge list of keys and values, which are used to filter tuples in a > stream. The list does not change. Currently i am sharing the list with each > operator instance via the constructor, although only a subset of the list is > required per operator (the assignment of subset to operator instance is > known). I cannot use DataSet based functions in a streaming execution > environment to assign sub lists. I also cannot use DataStream based > partitioning functions as the list is static, i.e. not a DataStream. The > dilemma exists as i am mixing static (DataSet type) content with streaming > content. Is there any other approach aside from using an additional tool > (e.g. distributed cache)? > > Thanks in advance. > > Regards > Leon > > > |
Free forum by Nabble | Edit this page |