(DEPRECATED) Apache Flink User Mailing List archive.

design question

Classic

List

Threaded

4 messages Options

Chen Bekor

design question

hi all,

I have a stream of incoming object versions (objects change over time) and a requirement to fetch from a datastore the last known object version in order to link it with the id of the new version, so that I end up with a linked list of object versions.

all object versions contain the same guid, so I was thinking about using flink streaming in order to assure ordering and avoid concurrency / race conditions in the linkage process (object version might arrive unordered or may arrive at spikes)

if I use the object guid as a key for a keyed stream I am concerned I will end up with millions of windowed streams hence causing OOM.

what do you think should be the right approach? do you think flink is the right technology for this task?

John Sherwood

Re: design question

This sounds like you have some per-key state to keep track of, so the 'correct' way to do it would be to keyBy the guid. I believe that if you run your environment using the Rocks DB state backend you will not OOM regardless of the number of GUIDs that are eventually tracked. Whether flink/stream processing is the most effective way to achieve your goal, I can't say, but I am fairly confident that this particular aspect is not a problem.

On Sat, Apr 23, 2016 at 1:13 AM, Chen Bekor <[hidden email]> wrote:

hi all,

I have a stream of incoming object versions (objects change over time) and a requirement to fetch from a datastore the last known object version in order to link it with the id of the new version, so that I end up with a linked list of object versions.

all object versions contain the same guid, so I was thinking about using flink streaming in order to assure ordering and avoid concurrency / race conditions in the linkage process (object version might arrive unordered or may arrive at spikes)

if I use the object guid as a key for a keyed stream I am concerned I will end up with millions of windowed streams hence causing OOM.

what do you think should be the right approach? do you think flink is the right technology for this task?

Chen Bekor

Re: design question

cool - can you point me to some docs about how to configure Rocks DB? I searched the online docs and found nothing substantial. Also - If I'm using HDFS (S3backed ) cluster, how would that effect RocksDB? can I configure it to run on optimized SSD etc?

any help is appreciated.

On Sun, Apr 24, 2016 at 7:57 AM, John Sherwood <[hidden email]> wrote:

This sounds like you have some per-key state to keep track of, so the 'correct' way to do it would be to keyBy the guid. I believe that if you run your environment using the Rocks DB state backend you will not OOM regardless of the number of GUIDs that are eventually tracked. Whether flink/stream processing is the most effective way to achieve your goal, I can't say, but I am fairly confident that this particular aspect is not a problem.

On Sat, Apr 23, 2016 at 1:13 AM, Chen Bekor <[hidden email]> wrote:
hi all,

I have a stream of incoming object versions (objects change over time) and a requirement to fetch from a datastore the last known object version in order to link it with the id of the new version, so that I end up with a linked list of object versions.

all object versions contain the same guid, so I was thinking about using flink streaming in order to assure ordering and avoid concurrency / race conditions in the linkage process (object version might arrive unordered or may arrive at spikes)

if I use the object guid as a key for a keyed stream I am concerned I will end up with millions of windowed streams hence causing OOM.

what do you think should be the right approach? do you think flink is the right technology for this task?

Aljoscha Krettek

Re: design question

Hi,

in the Flink doc there is this: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html#the-rocksdbstatebackend and this: RocksDBStateBackend

Cheers,

Aljoscha

On Sun, 24 Apr 2016 at 21:58 Chen Bekor <[hidden email]> wrote:

cool - can you point me to some docs about how to configure Rocks DB? I searched the online docs and found nothing substantial. Also - If I'm using HDFS (S3backed ) cluster, how would that effect RocksDB? can I configure it to run on optimized SSD etc?

any help is appreciated.

On Sun, Apr 24, 2016 at 7:57 AM, John Sherwood <[hidden email]> wrote:
This sounds like you have some per-key state to keep track of, so the 'correct' way to do it would be to keyBy the guid. I believe that if you run your environment using the Rocks DB state backend you will not OOM regardless of the number of GUIDs that are eventually tracked. Whether flink/stream processing is the most effective way to achieve your goal, I can't say, but I am fairly confident that this particular aspect is not a problem.

On Sat, Apr 23, 2016 at 1:13 AM, Chen Bekor <[hidden email]> wrote:
hi all,

I have a stream of incoming object versions (objects change over time) and a requirement to fetch from a datastore the last known object version in order to link it with the id of the new version, so that I end up with a linked list of object versions.

all object versions contain the same guid, so I was thinking about using flink streaming in order to assure ordering and avoid concurrency / race conditions in the linkage process (object version might arrive unordered or may arrive at spikes)

if I use the object guid as a key for a keyed stream I am concerned I will end up with millions of windowed streams hence causing OOM.

what do you think should be the right approach? do you think flink is the right technology for this task?