Calculating over multiple streams...

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Calculating over multiple streams...

Oytun Tez
Hi everyone!

I've been struggling with an implementation problem in the last days, which I am almost sure caused by my misunderstanding of Flink. 

The purpose: consume multiple streams, update a score list (with meta data e.g. user_id) for each update coming from any of the streams. The new output list will also need to be used by another pattern.
  1. We created 3 SourceFunctions, that periodically go to our MySQL database and stream new results back. This one returns POJOs.
  2. Then we flatMap these streams to unify their Type. They are now all Tuple3s with matching types.
  3. And we process each stream with the same ProcessFunction.
  4. I am stuck with the output list.
Business case (human translation workflow):
  1. Input: Stream "translation quality" score updates of each translator [translator_id, score]
  2. Input: Stream "responsivity score" updates of each translator (email open rates/speeds etc) [translator_id, score]
  3. Input: Stream "number of projects" updates each translator worked on [translator_id, score]
  4. Calculation: for each translator, use 3 scores to come up with a unified score and its percentile over all translators. This step definitely feels like a Batch job, but I am pushing to go with a streaming mindset.
  5. So now supposedly, in this way or another, I have a list of translators with their unified score and percentile over this list.
  6. Another independent stream should send me updates on "need for proofreaders" – I couldn't even come to this point yet. Once a need info is streamed, application would fetch the previously calculated list and let's say picks the top X determined by the message from need algorithm.

image.png

Overall, my desire is to make everything a stream and let the data and decisions constantly react to stream updates. I am very confused at this point. Tried using keyed and operator states, but they seem to be keeping their state only for their own items. Considering to do Batch instead after all the struggle.

Any ideas? I can even get on a call.














---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating over multiple streams...

Michael Latta
You may want to union the 3 streams prior to the process function if they are independently processed. 


Michael

On Feb 22, 2019, at 9:15 AM, Oytun Tez <[hidden email]> wrote:

Hi everyone!

I've been struggling with an implementation problem in the last days, which I am almost sure caused by my misunderstanding of Flink. 

The purpose: consume multiple streams, update a score list (with meta data e.g. user_id) for each update coming from any of the streams. The new output list will also need to be used by another pattern.
  1. We created 3 SourceFunctions, that periodically go to our MySQL database and stream new results back. This one returns POJOs.
  2. Then we flatMap these streams to unify their Type. They are now all Tuple3s with matching types.
  3. And we process each stream with the same ProcessFunction.
  4. I am stuck with the output list.
Business case (human translation workflow):
  1. Input: Stream "translation quality" score updates of each translator [translator_id, score]
  2. Input: Stream "responsivity score" updates of each translator (email open rates/speeds etc) [translator_id, score]
  3. Input: Stream "number of projects" updates each translator worked on [translator_id, score]
  4. Calculation: for each translator, use 3 scores to come up with a unified score and its percentile over all translators. This step definitely feels like a Batch job, but I am pushing to go with a streaming mindset.
  5. So now supposedly, in this way or another, I have a list of translators with their unified score and percentile over this list.
  6. Another independent stream should send me updates on "need for proofreaders" – I couldn't even come to this point yet. Once a need info is streamed, application would fetch the previously calculated list and let's say picks the top X determined by the message from need algorithm.

<image.png>

Overall, my desire is to make everything a stream and let the data and decisions constantly react to stream updates. I am very confused at this point. Tried using keyed and operator states, but they seem to be keeping their state only for their own items. Considering to do Batch instead after all the struggle.

Any ideas? I can even get on a call.














---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating over multiple streams...

Oytun Tez
Restructuring with your tip now, Michael, thank you!

---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Fri, Feb 22, 2019 at 11:23 AM Michael Latta <[hidden email]> wrote:
You may want to union the 3 streams prior to the process function if they are independently processed. 


Michael

On Feb 22, 2019, at 9:15 AM, Oytun Tez <[hidden email]> wrote:

Hi everyone!

I've been struggling with an implementation problem in the last days, which I am almost sure caused by my misunderstanding of Flink. 

The purpose: consume multiple streams, update a score list (with meta data e.g. user_id) for each update coming from any of the streams. The new output list will also need to be used by another pattern.
  1. We created 3 SourceFunctions, that periodically go to our MySQL database and stream new results back. This one returns POJOs.
  2. Then we flatMap these streams to unify their Type. They are now all Tuple3s with matching types.
  3. And we process each stream with the same ProcessFunction.
  4. I am stuck with the output list.
Business case (human translation workflow):
  1. Input: Stream "translation quality" score updates of each translator [translator_id, score]
  2. Input: Stream "responsivity score" updates of each translator (email open rates/speeds etc) [translator_id, score]
  3. Input: Stream "number of projects" updates each translator worked on [translator_id, score]
  4. Calculation: for each translator, use 3 scores to come up with a unified score and its percentile over all translators. This step definitely feels like a Batch job, but I am pushing to go with a streaming mindset.
  5. So now supposedly, in this way or another, I have a list of translators with their unified score and percentile over this list.
  6. Another independent stream should send me updates on "need for proofreaders" – I couldn't even come to this point yet. Once a need info is streamed, application would fetch the previously calculated list and let's say picks the top X determined by the message from need algorithm.

<image.png>

Overall, my desire is to make everything a stream and let the data and decisions constantly react to stream updates. I am very confused at this point. Tried using keyed and operator states, but they seem to be keeping their state only for their own items. Considering to do Batch instead after all the struggle.

Any ideas? I can even get on a call.














---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.