Re: Sync two DataStreams

Posted by David Anderson-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Sync-two-DataStreams-tp34076p34081.html

There are a few ways to pre-ingest data from a side input before beginning to process another stream. One is to use the State Processor API [1] to create a savepoint that has the data from that side input in its state. For a simple example of bootstrapping state into a savepoint, see [2].

Another approach is to buffer the stream to be validated in Flink state until the side input has been fully ingested. Or run the job once with no event traffic and take a savepoint once the model has been broadcast.

Yet another solution might be to use a custom source that reads from one topic and then the other. See [3] and [4] for an example of that.

Other references on this topic include FLIP-17 [5] and Gregory Fee's talk on bootstrapping state [6].

Regards,
David

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html
[2] https://gist.github.com/alpinegizmo/ff3d2e748287853c88f21259830b29cf
[3] https://stackoverflow.com/a/48711260/2000823
[4] https://github.com/ScaleUnlimited/flink-streaming-kmeans/blob/master/src/main/java/com/scaleunlimited/flinksources/UnionedSources.java
[5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
[6] https://www.youtube.com/watch?v=WdMcyN5QZZQ

On Fri, Apr 3, 2020 at 9:08 PM Georgi Stoyanov <[hidden email]> wrote:

>
> Hi,
>
> I want to implement a flow where the data from one stream is needed to
> validate data for second stream when the job is started without a
> savepoint or checkpoint.
>
> Both of them are reading from kafka. I want the data in the first one to
> be fully read and then to check the events from the second stream.
>
> Do you have any suggestions how this could be achieved (maybe without
> window or just put events in a state and start a timer)?
>
>
> Kind Regards
>
> G.S.