Sync two DataStreams

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Sync two DataStreams

Georgi Stoyanov
Hi,

I want to implement a flow where the data from one stream is needed to
validate data for second stream when the job is started without a
savepoint or checkpoint.

Both of them are reading from kafka. I want the data in the first one to
be fully read and then to check the events from the second stream.

Do you have any suggestions how this could be achieved (maybe without
window or just put events in a state and start a timer)?


Kind Regards

G.S.

Reply | Threaded
Open this post in threaded view
|

Re: Sync two DataStreams

David Anderson-2
There are a few ways to pre-ingest data from a side input before beginning to process another stream. One is to use the State Processor API [1] to create a savepoint that has the data from that side input in its state. For a simple example of bootstrapping state into a savepoint, see [2].

Another approach is to buffer the stream to be validated in Flink state until the side input has been fully ingested. Or run the job once with no event traffic and take a savepoint once the model has been broadcast.

Yet another solution might be to use a custom source that reads from one topic and then the other. See [3] and [4] for an example of that.

Other references on this topic include FLIP-17 [5] and Gregory Fee's talk on bootstrapping state [6].

Regards,
David

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html
[2] https://gist.github.com/alpinegizmo/ff3d2e748287853c88f21259830b29cf
[3] https://stackoverflow.com/a/48711260/2000823
[4] https://github.com/ScaleUnlimited/flink-streaming-kmeans/blob/master/src/main/java/com/scaleunlimited/flinksources/UnionedSources.java
[5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API

On Fri, Apr 3, 2020 at 9:08 PM Georgi Stoyanov <[hidden email]> wrote:

>
> Hi,
>
> I want to implement a flow where the data from one stream is needed to
> validate data for second stream when the job is started without a
> savepoint or checkpoint.
>
> Both of them are reading from kafka. I want the data in the first one to
> be fully read and then to check the events from the second stream.
>
> Do you have any suggestions how this could be achieved (maybe without
> window or just put events in a state and start a timer)?
>
>
> Kind Regards
>
> G.S.