Hi group, I want to bootstrap some aggregates based on historic data in S3
and then keep them updated based on a stream. To do this I was thinking of doing something like processing all of the historic data, doing a save point, then restoring my program from that save point but with a stream source instead. Does this seem like a reasonable approach or is there a better way to approach this functionality? There does not appear to be a straightforward way of doing it the way I was thinking so any advice would be appreciated. |
Hi Gregory, I have similar issue when dealing with historical data. We choose Lambda and figure out use case specific hand off protocol. Unless storage side can support replay logs within a time range, Streaming application authors still needs to carry extra work to implement batching layer What we learned is backfill historical log streams might be too expensive/ inefficient for streaming framework to handle since streaming framework focus on optimizing unknown streams. Hope it helps. Chen On Thu, Jan 25, 2018 at 12:49 PM, Gregory Fee <[hidden email]> wrote:
|
Hi,
I see this coming up more and more often these days. For now, the solution of doing a savepoint and switching sources should work but I've had it in my head for a while now to add functionality for bootstrapping inputs in the API. An operator would read from the bootstrap stream (which is finite) first, before switching over to reading from the other streams. The blocker for this is currently the network stack because this behaviour can potentially lead to distributed deadlocks because you back-pressure the streams on which you're not yet reading. Best, Aljoscha
|
Free forum by Nabble | Edit this page |