Statefull computation

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Statefull computation

defstat
Hi. I am struggling the past few days to find a solution on the following problem, using Apache Flink:

I have a stream of vectors, represented by files in a local folder. After a new text file is located using DataStream<String> text = env.readFileStream(...), I transform (flatMap), the Input into a SingleOutputStreamOperator<Tuple2<String, Integer>, ?>, with the Integer being the score coming from a scoring function.

I want to persist a global HashMap containing the top-k vectors, using their scores. I approached the problem using a statefull transformation.
1. The first problem I have is that the HashMap retains per-sink data (so for each thread of workers, one HashMap of data). How can I make that a Global collection

2. Using Apache Spark, I made that possible by
JavaPairDStream<String, Integer> stateDstream = tuples.updateStateByKey(updateFunction);

and then making transformations on the stateDstream. Is there a way I can get the same functionality using FLink?

Thanks in advance!