Hello, I have a fairly typical streaming use case but not able to figure how to implement it best in Flink. I want to join records read from a kafka stream with one(or more) dimension tables which are saved as flat files. As per this jira its not possible to join DataStream with DataSet. These tables are too big to do a collect() and join. It will be good to read these files during startup, do a partitionByHash and keep it cached. On the DataStream may be do a keyBy and join. Is something like this possible? Srikanth |
Hi Srikanth,
that's an interesting use case. It's not possible to do something like this out-of-box but I'm actually working on API for such cases. In the mean time, I programmed a short example that shows how something like this can be programmed using the API that is currently available. It requires writing a custom operator but it is still somewhat succinct: Please let me know if you have any questions. Cheers, Aljoscha On Thu, 21 Apr 2016 at 03:06 Srikanth <[hidden email]> wrote:
|
Hi, Lohith. Sent from my Sony Xperia™ smartphone ---- Aljoscha Krettek wrote ---- Hi Srikanth,
that's an interesting use case. It's not possible to do something like this out-of-box but I'm actually working on API for such cases.
In the mean time, I programmed a short example that shows how something like this can be programmed using the API that is currently available. It requires writing a custom operator but it is still somewhat succinct:
Please let me know if you have any questions.
Cheers,
Aljoscha
On Thu, 21 Apr 2016 at 03:06 Srikanth <[hidden email]> wrote:
|
Aljoscha, Looks like a potential solution. Feels a bit hacky though. Didn't quite understand why a list backed store is used to for static input buffer? Join(inner) should emit only one record if there is a key match. Is it a property of the system to emit Long.MAX_VALUE watermark when a finite stream source ends? If so can I do something like this to read static file in parallel? val meta = env.readTextFile("S3:///path/to/file").map(...).keyBy(...) Shouldn't we also override checkpoint handling of custom operator? If so, should the checkpoint wait/fail during the initial read phase? Lohith, Adding a component like Cassandra just for this feels like a overkill. But if I can't find a suitable way to do this, I might use it( or Redis probably). Srikanth On Fri, Apr 22, 2016 at 12:20 PM, Lohith Samaga M <[hidden email]> wrote:
|
Aljoscha, Your thoughts on this? Srikanth On Mon, Apr 25, 2016 at 8:08 PM, Srikanth <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |