(DEPRECATED) Apache Flink User Mailing List archive.

Bootstrapping

Classic

List

Threaded

3 messages Options

Gregory Fee

Bootstrapping

Hi group, I want to bootstrap some aggregates based on historic data in S3
and then keep them updated based on a stream. To do this I was thinking of
doing something like processing all of the historic data, doing a save
point, then restoring my program from that save point but with a stream
source instead. Does this seem like a reasonable approach or is there a
better way to approach this functionality? There does not appear to be a
straightforward way of doing it the way I was thinking so
any advice would be appreciated.

Engineer
<a href="tel:+14258304734" style="font-size:13px;color:rgb(73,79,80);font-family:'Helvetica Neue',Helvetica,Arial,sans-serif;text-decoration:none" target="_blank">425.830.4734

Chen Qin

Re: Bootstrapping

Hi Gregory,

I have similar issue when dealing with historical data. We choose Lambda and figure out use case specific hand off protocol.

Unless storage side can support replay logs within a time range, Streaming application authors still needs to carry extra work to implement batching layer

What we learned is backfill historical log streams might be too expensive/ inefficient for streaming framework to handle since streaming framework focus on optimizing unknown streams.

Hope it helps.

Chen

On Thu, Jan 25, 2018 at 12:49 PM, Gregory Fee <[hidden email]> wrote:

Hi group, I want to bootstrap some aggregates based on historic data in S3
and then keep them updated based on a stream. To do this I was thinking of
doing something like processing all of the historic data, doing a save
point, then restoring my program from that save point but with a stream
source instead. Does this seem like a reasonable approach or is there a
better way to approach this functionality? There does not appear to be a
straightforward way of doing it the way I was thinking so
any advice would be appreciated.

--
Gregory Fee
Engineer
<a href="tel:+14258304734" style="font-size:13px;color:rgb(73,79,80);font-family:'Helvetica Neue',Helvetica,Arial,sans-serif;text-decoration:none" target="_blank">425.830.4734

Aljoscha Krettek

Re: Bootstrapping

Hi,

I see this coming up more and more often these days. For now, the solution of doing a savepoint and switching sources should work but I've had it in my head for a while now to add functionality for bootstrapping inputs in the API. An operator would read from the bootstrap stream (which is finite) first, before switching over to reading from the other streams. The blocker for this is currently the network stack because this behaviour can potentially lead to distributed deadlocks because you back-pressure the streams on which you're not yet reading.

Best,

Aljoscha

On 25. Jan 2018, at 23:58, Chen Qin <[hidden email]> wrote:

Hi Gregory,

I have similar issue when dealing with historical data. We choose Lambda and figure out use case specific hand off protocol.
Unless storage side can support replay logs within a time range, Streaming application authors still needs to carry extra work to implement batching layer

What we learned is backfill historical log streams might be too expensive/ inefficient for streaming framework to handle since streaming framework focus on optimizing unknown streams.

Hope it helps.

Chen

On Thu, Jan 25, 2018 at 12:49 PM, Gregory Fee <[hidden email]> wrote:
Hi group, I want to bootstrap some aggregates based on historic data in S3
and then keep them updated based on a stream. To do this I was thinking of
doing something like processing all of the historic data, doing a save
point, then restoring my program from that save point but with a stream
source instead. Does this seem like a reasonable approach or is there a
better way to approach this functionality? There does not appear to be a
straightforward way of doing it the way I was thinking so
any advice would be appreciated.

--
Gregory Fee
Engineer
<a href="tel:+14258304734" style="font-size:13px;color:rgb(73,79,80);font-family:'Helvetica Neue',Helvetica,Arial,sans-serif;text-decoration:none" target="_blank" class="">425.830.4734