Enriching data from external source with cache

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Enriching data from external source with cache

Derek VerLee
My basic problem will sound familiar I think, I need to enrich incoming data using a REST call to an external system for slowly evolving metadata. and some cache based lag is acceptable, so to reduce load on the external system and to process more efficiently, I would like to implement a cache.  The cache would by key, and I am already doing a keyBy for the same key in the job.

Please correct me if I'm wrong:
* Keyed State would be great to store my metadata "cache", Async I/O is ideal for pulling from the external system,
but AsyncFunction can not access keyed state ( "Exception: State is not supported in rich async functions.") and operators can not share state between them.

This leaves me wondering, since side inputs are not here yet, what the best (and perhaps most idiomatic) way to approach my problem?

I'd rather keep changes to existing systems minimal for this iteration and just minimize impact on them during peaks best I can... systemic refactoring and re-architecture will be coming soon (so I'm happy to hear thoughts on that as well).

Approaches considered:

1. AsyncFunction with a transient guava cache.  Not ideal ... but maybe good enough to get by
2. Using compound message types (oh, if only java had real algebraic data types...) and send cache miss messages from some CacheEnrichmentMapper (keyed) to some AsyncCacheLoader (not keyed) which then backfeeds cache updates to the former via iteration ... i don't know why this couldn't work but it feels like a hot mess unless there is some way I am not thinking of to do it cleanly
3. One user mentioned on a similar thread loading the data in as another DataStream and then using joins, but I'm confused about how this would work, it seems to me that joins happen on windows, windows pertain to (some notion of) time, what would be my notion of time for the slow (maybe years old in some cases) meta-data?
4. Forget about async I/O
5. implement my own "async i/o" in using a process function or similar  .. is this a valid pattern
Reply | Threaded
Open this post in threaded view
|

Re: Enriching data from external source with cache

Timo Walther
Hi Derek,

maybe the following talk can inspire you, how to do this with joins and
async IO: https://www.youtube.com/watch?v=Do7C4UJyWCM (around the 17th
min). Basically, you split the stream and wait for an Async IO result in
a downstream operator.

But I think having a transient guava cache is not a bad idea, since it
is only a cache it does not need to be checkpointed and can be recovered
at any time.

Implementing you own logic in a ProcessFunction is always a way, but
might require more implementation effort.

Btw. if you feel brave enough, you could also think of contributing a
stateful async IO. It should not be too much effort to make this work.

Regards,
Timo



Am 9/29/17 um 8:39 PM schrieb Derek VerLee:

> My basic problem will sound familiar I think, I need to enrich
> incoming data using a REST call to an external system for slowly
> evolving metadata. and some cache based lag is acceptable, so to
> reduce load on the external system and to process more efficiently, I
> would like to implement a cache.  The cache would by key, and I am
> already doing a keyBy for the same key in the job.
>
> Please correct me if I'm wrong:
> * Keyed State would be great to store my metadata "cache", Async I/O
> is ideal for pulling from the external system,
> but AsyncFunction can not access keyed state ( "Exception: State is
> not supported in rich async functions.") and operators can not share
> state between them.
>
> This leaves me wondering, since side inputs are not here yet, what the
> best (and perhaps most idiomatic) way to approach my problem?
>
> I'd rather keep changes to existing systems minimal for this iteration
> and just minimize impact on them during peaks best I can... systemic
> refactoring and re-architecture will be coming soon (so I'm happy to
> hear thoughts on that as well).
>
> Approaches considered:
>
> 1. AsyncFunction with a transient guava cache.  Not ideal ... but
> maybe good enough to get by
> 2. Using compound message types (oh, if only java had real algebraic
> data types...) and send cache miss messages from some
> CacheEnrichmentMapper (keyed) to some AsyncCacheLoader (not keyed)
> which then backfeeds cache updates to the former via iteration ... i
> don't know why this couldn't work but it feels like a hot mess unless
> there is some way I am not thinking of to do it cleanly
> 3. One user mentioned on a similar thread loading the data in as
> another DataStream and then using joins, but I'm confused about how
> this would work, it seems to me that joins happen on windows, windows
> pertain to (some notion of) time, what would be my notion of time for
> the slow (maybe years old in some cases) meta-data?
> 4. Forget about async I/O
> 5. implement my own "async i/o" in using a process function or
> similar  .. is this a valid pattern


Reply | Threaded
Open this post in threaded view
|

Re: Enriching data from external source with cache

Derek VerLee

Thanks Timo, watching the video now.

I did try out the method with iteration in a simple prototype and it works.  But you are right, combining it with the other requirements into a single process function has so far resulted in more complexity than I'd like, and it's always nice to leave something easily understood later. 

On the contribution, I was wondering there was some scary with async and keyed state going on that prevented this from having happened already.  I'll have a look and see if I can find where the current non keyed implementation logic resides in the project.

Thanks

On 10/2/17 6:07 AM, Timo Walther wrote:
Hi Derek,

maybe the following talk can inspire you, how to do this with joins and async IO: https://www.youtube.com/watch?v=Do7C4UJyWCM (around the 17th min). Basically, you split the stream and wait for an Async IO result in a downstream operator.

But I think having a transient guava cache is not a bad idea, since it is only a cache it does not need to be checkpointed and can be recovered at any time.

Implementing you own logic in a ProcessFunction is always a way, but might require more implementation effort.

Btw. if you feel brave enough, you could also think of contributing a stateful async IO. It should not be too much effort to make this work.

Regards,
Timo



Am 9/29/17 um 8:39 PM schrieb Derek VerLee:
My basic problem will sound familiar I think, I need to enrich incoming data using a REST call to an external system for slowly evolving metadata. and some cache based lag is acceptable, so to reduce load on the external system and to process more efficiently, I would like to implement a cache.  The cache would by key, and I am already doing a keyBy for the same key in the job.

Please correct me if I'm wrong:
* Keyed State would be great to store my metadata "cache", Async I/O is ideal for pulling from the external system,
but AsyncFunction can not access keyed state ( "Exception: State is not supported in rich async functions.") and operators can not share state between them.

This leaves me wondering, since side inputs are not here yet, what the best (and perhaps most idiomatic) way to approach my problem?

I'd rather keep changes to existing systems minimal for this iteration and just minimize impact on them during peaks best I can... systemic refactoring and re-architecture will be coming soon (so I'm happy to hear thoughts on that as well).

Approaches considered:

1. AsyncFunction with a transient guava cache.  Not ideal ... but maybe good enough to get by
2. Using compound message types (oh, if only java had real algebraic data types...) and send cache miss messages from some CacheEnrichmentMapper (keyed) to some AsyncCacheLoader (not keyed) which then backfeeds cache updates to the former via iteration ... i don't know why this couldn't work but it feels like a hot mess unless there is some way I am not thinking of to do it cleanly
3. One user mentioned on a similar thread loading the data in as another DataStream and then using joins, but I'm confused about how this would work, it seems to me that joins happen on windows, windows pertain to (some notion of) time, what would be my notion of time for the slow (maybe years old in some cases) meta-data?
4. Forget about async I/O
5. implement my own "async i/o" in using a process function or similar  .. is this a valid pattern