Re: How to stream intermediate data that is stored in external storage?

Posted by Piotr Nowojski-3 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/How-to-stream-intermediate-data-that-is-stored-in-external-storage-tp30748p30778.html

Hi Kant,

Checkpointing interval is configurable, but I wouldn’t count on it working well with even 10s intervals. 
 
I think what you are this is not supported by Flink generically. Maybe Queryable state I mentioned before? But I have never used it.

I think you would have to implement your own custom operator that would output changes to it’s internal state as a side output.

Piotrek

On 30 Oct 2019, at 16:14, kant kodali <[hidden email]> wrote:

Hi Piotr,

I am talking about the internal state. How often this state gets checkpointed? if it is every few seconds then it may not meet our real-time requirement(sub second).

The question really is can I read this internal state in a streaming fashion in an update mode? The state processor API seems to expose DataSet but not DataStream so I am not sure how to read internal state in streaming fashion in an update made?

Thanks!

On Wed, Oct 30, 2019 at 7:25 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

I’m not sure what are you trying to achieve. What do you mean by “state of full outer join”? The result of it? Or it’s internal state? Also keep in mind, that internal state of the operators in Flink is already snapshoted/written down to an external storage during checkpointing mechanism.

The result should be simple, just write it to some Sink.

For the internal state, it sounds like you are doing something not the way it was intended… having said that, you can try one of the following options:
a) Implement your own outer join operator (might not be as easy if you are using Table API/SQL) and just create a side output for the state changes.
b) Use state processor API to read the content of a savepoint/checkpoint [1][2]
c) Use queryable state [3] (I’m not sure about this, I have never used queryable state)

Piotrek

[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/queryable_state.html

On 29 Oct 2019, at 16:42, kant kodali <[hidden email]> wrote:

Hi All,

I want to do a full outer join on two streaming data sources and store the state of full outer join in some external storage like rocksdb or something else. And then want to use this intermediate state as a streaming source again, do some transformation and write it to some external store. is that possible with Flink 1.9?

Also what storage systems support push mechanism for the intermediate data? For example, In the use case above does rocksdb support push/emit events in a streaming fashion?

Thanks!