(DEPRECATED) Apache Flink User Mailing List archive.

Join of DataStream and DataSet

Classic

List

Threaded

3 messages Options

Reminia Scarlet

Join of DataStream and DataSet

Spark streaming supports direct join from stream DataFrame and batch DataFrame , and it's

easy to implement an enrich pipeline that joins a stream and a dimension table.

I checked the doc of flink, seems that this feature is a jira ticket which haven't been resolved yet.

So how can I implement such a pipeline easily in Flink?

Hequn Cheng

Re: Join of DataStream and DataSet

Hi Reminia,

Currently, we can't join a DataStream with a DataSet in Flink. However, the DataSet is actually a kind of bounded stream. From the point of this view, you can use a streaming job to achieve your goal. Flink Table API & SQL support different kinds of join[1]. You can take a closer look at them. Probably a regular join[2] is ok for you.

Finally, I think you raised a very good point. It would be better if Flink can support such kind of join more direct and efficient.

Best, Hequn

[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/tableApi.html#joins

[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/joins.html#regular-joins

On Thu, Apr 11, 2019 at 5:16 PM Reminia Scarlet <[hidden email]> wrote:

Spark streaming supports direct join from stream DataFrame and batch DataFrame , and it's
easy to implement an enrich pipeline that joins a stream and a dimension table.

I checked the doc of flink, seems that this feature is a jira ticket which haven't been resolved yet.

So how can I implement such a pipeline easily in Flink?

Fabian Hueske-2

Re: Join of DataStream and DataSet

Hi Reminia,

What Hequn said is correct.

However, I would *not* use a regular but model the problem as a time-versioned table join.

A regular join will materialize both inputs which is probably not want you want to do for a stream.

For a time-versioned table join, only the time-versioned table would be stored (this should be your DataSet) and the stream is just streamed along.

Best, Fabian

Am Mo., 15. Apr. 2019 um 04:02 Uhr schrieb Hequn Cheng <[hidden email]>:

Hi Reminia,

Currently, we can't join a DataStream with a DataSet in Flink. However, the DataSet is actually a kind of bounded stream. From the point of this view, you can use a streaming job to achieve your goal. Flink Table API & SQL support different kinds of join[1]. You can take a closer look at them. Probably a regular join[2] is ok for you.

Finally, I think you raised a very good point. It would be better if Flink can support such kind of join more direct and efficient.

Best, Hequn

[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/tableApi.html#joins
[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/joins.html#regular-joins

On Thu, Apr 11, 2019 at 5:16 PM Reminia Scarlet <[hidden email]> wrote:
Spark streaming supports direct join from stream DataFrame and batch DataFrame , and it's
easy to implement an enrich pipeline that joins a stream and a dimension table.

I checked the doc of flink, seems that this feature is a jira ticket which haven't been resolved yet.

So how can I implement such a pipeline easily in Flink?